[RFC PATCH 0/5] EFI Special Purpose Memory Support

2019-04-04T19:21:09Z

The EFI 2.8 Specification [1] introduces the EFI_MEMORY_SP ("special
purpose") memory attribute. This attribute bit replaces the deprecated
"reservation hint" that was introduced in ACPI 6.2 and removed in ACPI
6.3.

Given the increasing diversity of memory types that might be advertised
to the operating system, there is a need for platform firmware to hint
which memory ranges are free for the OS to use as general purpose memory
and which ranges are intended for application specific usage. For
example, an application with prior knowledge of the platform may expect
to be able to exclusively allocate a precious / limited pool of high
bandwidth memory. Alternatively, for the general purpose case, the
operating system may want to make the memory available on a best effort
basis as a unique numa-node with performance properties by the new
CONFIG_HMEM_REPORTING [2] facility.

In support of allowing for both exclusive and core-kernel-mm managed
access to differentiated memory, claim EFI_MEMORY_SP ranges for exposure
as device-dax instances by default. Those instances can be directly
owned / mapped by a platform-topology-aware application. However, with
the new kmem facility [3], the administrator has the option to instead
designate that those memory ranges be hot-added to the core-kernel-mm as
a unique memory numa-node. In short, allow for the decision about what
software agent manages special purpose memory to be made at runtime.

The patches are based on v8 of Keith's "HMEM" series currently in Greg's
driver-core-testing branch [4], and have not been tested. This is an RFC
proposal on how to handle the new EFI memory attribute.

[1]: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git/commit/?h=driver-core-testing&id=b6efba75c449
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308f
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git/log/?h=driver-core-testing

---

Dan Williams (5):
      efi: Detect UEFI 2.8 Special Purpose Memory
      lib/memregion: Uplevel the pmem "region" ida to a global allocator
      acpi/hmat: Track target address ranges
      acpi/hmat: Register special purpose memory as a device
      device-dax: Add a driver for "hmem" devices


 arch/x86/Kconfig                  |   18 +++++
 arch/x86/boot/compressed/eboot.c  |    5 +
 arch/x86/boot/compressed/kaslr.c  |    2 -
 arch/x86/include/asm/e820/types.h |    9 ++
 arch/x86/kernel/e820.c            |    9 ++
 arch/x86/platform/efi/efi.c       |   10 ++-
 drivers/acpi/hmat/Kconfig         |    1 
 drivers/acpi/hmat/hmat.c          |  140 +++++++++++++++++++++++++++++++------
 drivers/dax/Kconfig               |   26 ++++++-
 drivers/dax/Makefile              |    2 +
 drivers/dax/hmem.c                |   58 +++++++++++++++
 drivers/nvdimm/Kconfig            |    1 
 drivers/nvdimm/core.c             |    1 
 drivers/nvdimm/nd-core.h          |    1 
 drivers/nvdimm/region_devs.c      |   13 +--
 include/linux/efi.h               |   14 ++++
 include/linux/ioport.h            |    1 
 include/linux/memregion.h         |    9 ++
 lib/Kconfig                       |    6 ++
 lib/Makefile                      |    1 
 lib/memregion.c                   |   22 ++++++
 21 files changed, 304 insertions(+), 45 deletions(-)
 create mode 100644 drivers/dax/hmem.c
 create mode 100644 include/linux/memregion.h
 create mode 100644 lib/memregion.c

[RFC PATCH 1/5] efi: Detect UEFI 2.8 Special Purpose Memory

2019-04-04T19:21:15Z

UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
interpretation of the EFI Memory Types as "reserved for a special
purpose".

The proposed Linux behavior for special purpose memory is that it is
reserved for direct-access (device-dax) by default and not available for
any kernel usage, not even as an OOM fallback. Later, through udev
scripts or another init mechanism, these device-dax claimed ranges can
be reconfigured and hot-added to the available System-RAM with a unique
node identifier.

A follow-on patch integrates parsing of the ACPI HMAT to identify the
node and sub-range boundaries of EFI_MEMORY_SP designated memory. For
now, arrange for EFI_MEMORY_SP memory to be reserved.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ard Biesheuvel 
Cc: Darren Hart 
Cc: Andy Shevchenko 
Signed-off-by: Dan Williams 
---
 arch/x86/Kconfig                  |   18 ++++++++++++++++++
 arch/x86/boot/compressed/eboot.c  |    5 ++++-
 arch/x86/boot/compressed/kaslr.c  |    2 +-
 arch/x86/include/asm/e820/types.h |    9 +++++++++
 arch/x86/kernel/e820.c            |    9 +++++++--
 arch/x86/platform/efi/efi.c       |   10 +++++++++-
 include/linux/efi.h               |   14 ++++++++++++++
 include/linux/ioport.h            |    1 +
 8 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c1f9b3cf437c..cb9ca27de7a5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1961,6 +1961,24 @@ config EFI_MIXED
 
 	   If unsure, say N.
 
+config EFI_SPECIAL_MEMORY
+	bool "EFI Special Purpose Memory Support"
+	depends on EFI
+	---help---
+	  On systems that have mixed performance classes of memory EFI
+	  may indicate special purpose memory with an attribute (See
+	  EFI_MEMORY_SP in UEFI 2.8). A memory range tagged with this
+	  attribute may have unique performance characteristics compared
+	  to the system's general purpose "System RAM" pool. On the
+	  expectation that such memory has application specific usage
+	  answer Y to arrange for the kernel to reserve it for
+	  direct-access (device-dax) by default. The memory range can
+	  later be optionally assigned to the page allocator by system
+	  administrator policy. Say N to have the kernel treat this
+	  memory as general purpose by default.
+
+	  If unsure, say Y.
+
 config SECCOMP
 	def_bool y
 	prompt "Enable seccomp to safely compute untrusted bytecode"
diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
index 544ac4fafd11..9b90fae21abe 100644
--- a/arch/x86/boot/compressed/eboot.c
+++ b/arch/x86/boot/compressed/eboot.c
@@ -560,7 +560,10 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 		case EFI_BOOT_SERVICES_CODE:
 		case EFI_BOOT_SERVICES_DATA:
 		case EFI_CONVENTIONAL_MEMORY:
-			e820_type = E820_TYPE_RAM;
+			if (is_efi_special(d))
+				e820_type = E820_TYPE_SPECIAL;
+			else
+				e820_type = E820_TYPE_RAM;
 			break;
 
 		case EFI_ACPI_MEMORY_NVS:
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 2e53c056ba20..897e46eb9714 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -757,7 +757,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
 		 *
 		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
 		 */
-		if (md->type != EFI_CONVENTIONAL_MEMORY)
+		if (md->type != EFI_CONVENTIONAL_MEMORY || is_efi_special(md))
 			continue;
 
 		if (efi_mirror_found &&
diff --git a/arch/x86/include/asm/e820/types.h b/arch/x86/include/asm/e820/types.h
index c3aa4b5e49e2..0ab8abae2e8b 100644
--- a/arch/x86/include/asm/e820/types.h
+++ b/arch/x86/include/asm/e820/types.h
@@ -28,6 +28,15 @@ enum e820_type {
 	 */
 	E820_TYPE_PRAM		= 12,
 
+	/*
+	 * Special-purpose / application-specific memory is indicated to
+	 * the system via the EFI_MEMORY_SP attribute. Define an e820
+	 * translation of this memory type for the purpose of
+	 * reserving this range and marking it with the
+	 * IORES_DESC_APPLICATION_RESERVED designation.
+	 */
+	E820_TYPE_SPECIAL	= 0xefffffff,
+
 	/*
 	 * Reserved RAM used by the kernel itself if
 	 * CONFIG_INTEL_TXT=y is enabled, memory of this type
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 2879e234e193..9f50dd0bbb04 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -176,6 +176,7 @@ static void __init e820_print_type(enum e820_type type)
 	switch (type) {
 	case E820_TYPE_RAM:		/* Fall through: */
 	case E820_TYPE_RESERVED_KERN:	pr_cont("usable");			break;
+	case E820_TYPE_SPECIAL:		/* Fall through: */
 	case E820_TYPE_RESERVED:	pr_cont("reserved");			break;
 	case E820_TYPE_ACPI:		pr_cont("ACPI data");			break;
 	case E820_TYPE_NVS:		pr_cont("ACPI NVS");			break;
@@ -1023,6 +1024,7 @@ static const char *__init e820_type_to_string(struct e820_entry *entry)
 	case E820_TYPE_UNUSABLE:	return "Unusable memory";
 	case E820_TYPE_PRAM:		return "Persistent Memory (legacy)";
 	case E820_TYPE_PMEM:		return "Persistent Memory";
+	case E820_TYPE_SPECIAL:		/* Fall-through: */
 	case E820_TYPE_RESERVED:	return "Reserved";
 	default:			return "Unknown E820 type";
 	}
@@ -1038,6 +1040,7 @@ static unsigned long __init e820_type_to_iomem_type(struct e820_entry *entry)
 	case E820_TYPE_UNUSABLE:	/* Fall-through: */
 	case E820_TYPE_PRAM:		/* Fall-through: */
 	case E820_TYPE_PMEM:		/* Fall-through: */
+	case E820_TYPE_SPECIAL:		/* Fall-through: */
 	case E820_TYPE_RESERVED:	/* Fall-through: */
 	default:			return IORESOURCE_MEM;
 	}
@@ -1050,6 +1053,7 @@ static unsigned long __init e820_type_to_iores_desc(struct e820_entry *entry)
 	case E820_TYPE_NVS:		return IORES_DESC_ACPI_NV_STORAGE;
 	case E820_TYPE_PMEM:		return IORES_DESC_PERSISTENT_MEMORY;
 	case E820_TYPE_PRAM:		return IORES_DESC_PERSISTENT_MEMORY_LEGACY;
+	case E820_TYPE_SPECIAL:		return IORES_DESC_APPLICATION_RESERVED;
 	case E820_TYPE_RESERVED_KERN:	/* Fall-through: */
 	case E820_TYPE_RAM:		/* Fall-through: */
 	case E820_TYPE_UNUSABLE:	/* Fall-through: */
@@ -1065,13 +1069,14 @@ static bool __init do_mark_busy(enum e820_type type, struct resource *res)
 		return true;
 
 	/*
-	 * Treat persistent memory like device memory, i.e. reserve it
-	 * for exclusive use of a driver
+	 * Treat persistent memory and other special memory ranges like
+	 * device memory, i.e. reserve it for exclusive use of a driver
 	 */
 	switch (type) {
 	case E820_TYPE_RESERVED:
 	case E820_TYPE_PRAM:
 	case E820_TYPE_PMEM:
+	case E820_TYPE_SPECIAL:
 		return false;
 	case E820_TYPE_RESERVED_KERN:
 	case E820_TYPE_RAM:
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index e1cb01a22fa8..d227751f331b 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -139,7 +139,9 @@ static void __init do_add_efi_memmap(void)
 		case EFI_BOOT_SERVICES_CODE:
 		case EFI_BOOT_SERVICES_DATA:
 		case EFI_CONVENTIONAL_MEMORY:
-			if (md->attribute & EFI_MEMORY_WB)
+			if (is_efi_special(md))
+				e820_type = E820_TYPE_SPECIAL;
+			else if (md->attribute & EFI_MEMORY_WB)
 				e820_type = E820_TYPE_RAM;
 			else
 				e820_type = E820_TYPE_RESERVED;
@@ -753,6 +755,12 @@ static bool should_map_region(efi_memory_desc_t *md)
 	if (IS_ENABLED(CONFIG_X86_32))
 		return false;
 
+	/*
+	 * Special purpose memory is not mapped by default.
+	 */
+	if (is_efi_special(md))
+		return false;
+
 	/*
 	 * Map all of RAM so that we can access arguments in the 1:1
 	 * mapping when making EFI runtime calls.
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 54357a258b35..cecbc2bda1da 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -112,6 +112,7 @@ typedef	struct {
 #define EFI_MEMORY_MORE_RELIABLE \
 				((u64)0x0000000000010000ULL)	/* higher reliability */
 #define EFI_MEMORY_RO		((u64)0x0000000000020000ULL)	/* read-only */
+#define EFI_MEMORY_SP		((u64)0x0000000000040000ULL)	/* special purpose */
 #define EFI_MEMORY_RUNTIME	((u64)0x8000000000000000ULL)	/* range requires runtime mapping */
 #define EFI_MEMORY_DESCRIPTOR_VERSION	1
 
@@ -128,6 +129,19 @@ typedef struct {
 	u64 attribute;
 } efi_memory_desc_t;
 
+#ifdef CONFIG_EFI_SPECIAL_MEMORY
+static inline bool is_efi_special(efi_memory_desc_t *md)
+{
+	return md->type == EFI_CONVENTIONAL_MEMORY
+		&& (md->attribute & EFI_MEMORY_SP);
+}
+#else
+static inline bool is_efi_special(efi_memory_desc_t *md)
+{
+	return false;
+}
+#endif
+
 typedef struct {
 	efi_guid_t guid;
 	u32 headersize;
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index da0ebaec25f0..2d79841ee9b9 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -133,6 +133,7 @@ enum {
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
 	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
 	IORES_DESC_DEVICE_PUBLIC_MEMORY		= 7,
+	IORES_DESC_APPLICATION_RESERVED		= 8,
 };
 
 /* helpers to define resources */

[RFC PATCH 2/5] lib/memregion: Uplevel the pmem "region" ida to a global allocator

2019-04-04T19:21:20Z

In preparation for handling platform differentiated memory types beyond
persistent memory, uplevel the "region" identifier to a global number
space. This enables a device-dax instance to be registered to any memory
type with guaranteed unique names.

Cc: Keith Busch 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/Kconfig       |    1 +
 drivers/nvdimm/core.c        |    1 -
 drivers/nvdimm/nd-core.h     |    1 -
 drivers/nvdimm/region_devs.c |   13 ++++---------
 include/linux/memregion.h    |    6 ++++++
 lib/Kconfig                  |    6 ++++++
 lib/Makefile                 |    1 +
 lib/memregion.c              |   22 ++++++++++++++++++++++
 8 files changed, 40 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/memregion.h
 create mode 100644 lib/memregion.c

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 5e27918e4624..bf26cc5f6d67 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -3,6 +3,7 @@ menuconfig LIBNVDIMM
 	depends on PHYS_ADDR_T_64BIT
 	depends on HAS_IOMEM
 	depends on BLK_DEV
+	select MEMREGION
 	help
 	  Generic support for non-volatile memory devices including
 	  ACPI-6-NFIT defined resources.  On platforms that define an
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index acce050856a8..75fe651d327d 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -463,7 +463,6 @@ static __exit void libnvdimm_exit(void)
 	nd_region_exit();
 	nvdimm_exit();
 	nvdimm_bus_exit();
-	nd_region_devs_exit();
 	nvdimm_devs_exit();
 }
 
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index e5ffd5733540..17561302dfec 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -133,7 +133,6 @@ struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
 int __init nvdimm_bus_init(void);
 void nvdimm_bus_exit(void);
 void nvdimm_devs_exit(void);
-void nd_region_devs_exit(void);
 void nd_region_probe_success(struct nvdimm_bus *nvdimm_bus, struct device *dev);
 struct nd_region;
 void nd_region_create_ns_seed(struct nd_region *nd_region);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..eefdfd2565dd 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -27,7 +28,6 @@
  */
 #include 
 
-static DEFINE_IDA(region_ida);
 static DEFINE_PER_CPU(int, flush_idx);
 
 static int nvdimm_map_flush(struct device *dev, struct nvdimm *nvdimm, int dimm,
@@ -141,7 +141,7 @@ static void nd_region_release(struct device *dev)
 		put_device(&nvdimm->dev);
 	}
 	free_percpu(nd_region->lane);
-	ida_simple_remove(&region_ida, nd_region->id);
+	memregion_free(nd_region->id);
 	if (is_nd_blk(dev))
 		kfree(to_nd_blk_region(dev));
 	else
@@ -1036,7 +1036,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 
 	if (!region_buf)
 		return NULL;
-	nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
+	nd_region->id = memregion_alloc();
 	if (nd_region->id < 0)
 		goto err_id;
 
@@ -1090,7 +1090,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	return nd_region;
 
  err_percpu:
-	ida_simple_remove(&region_ida, nd_region->id);
+	memregion_free(nd_region->id);
  err_id:
 	kfree(region_buf);
 	return NULL;
@@ -1237,8 +1237,3 @@ int nd_region_conflict(struct nd_region *nd_region, resource_size_t start,
 
 	return device_for_each_child(&nvdimm_bus->dev, &ctx, region_conflict);
 }
-
-void __exit nd_region_devs_exit(void)
-{
-	ida_destroy(&region_ida);
-}
diff --git a/include/linux/memregion.h b/include/linux/memregion.h
new file mode 100644
index 000000000000..99fa47793b49
--- /dev/null
+++ b/include/linux/memregion.h
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _MEMREGION_H_
+#define _MEMREGION_H_
+int memregion_alloc(void);
+void memregion_free(int id);
+#endif /* _MEMREGION_H_ */
diff --git a/lib/Kconfig b/lib/Kconfig
index a9e56539bd11..0b765d9a1145 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -318,6 +318,12 @@ config DECOMPRESS_LZ4
 config GENERIC_ALLOCATOR
 	bool
 
+#
+# Generic IDA for memory regions
+#
+config MEMREGION
+	bool
+
 #
 # reed solomon support is select'ed if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index 3b08673e8881..6e237c4071af 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -122,6 +122,7 @@ obj-$(CONFIG_LIBCRC32C)	+= libcrc32c.o
 obj-$(CONFIG_CRC8)	+= crc8.o
 obj-$(CONFIG_XXHASH)	+= xxhash.o
 obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
+obj-$(CONFIG_MEMREGION) += memregion.o
 
 obj-$(CONFIG_842_COMPRESS) += 842/
 obj-$(CONFIG_842_DECOMPRESS) += 842/
diff --git a/lib/memregion.c b/lib/memregion.c
new file mode 100644
index 000000000000..21f5ff96c2eb
--- /dev/null
+++ b/lib/memregion.c
@@ -0,0 +1,22 @@
+#include 
+#include 
+
+static DEFINE_IDA(region_ida);
+
+int memregion_alloc(void)
+{
+	return ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
+}
+EXPORT_SYMBOL(memregion_alloc);
+
+void memregion_free(int id)
+{
+	ida_simple_remove(&region_ida, id);
+}
+EXPORT_SYMBOL(memregion_free);
+
+static void __exit memregion_exit(void)
+{
+	ida_destroy(&region_ida);
+}
+module_exit(memregion_exit);

[RFC PATCH 3/5] acpi/hmat: Track target address ranges

2019-04-04T19:21:57Z

As of ACPI 6.3 the HMAT no longer advertises the physical memory address
range for its entries. Instead, the expectation is the corresponding
entry in the SRAT is looked up by the target proximity domain.

Given there may be multiple distinct address ranges that share the same
performance profile (sparse address space), find_mem_target() is updated
to also consider the start address of the memory range. Target property
updates are also adjusted to loop over all possible 'struct target'
instances that may share the same proximity domain identification.

Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Keith Busch 
Cc: Jonathan Cameron 
Signed-off-by: Dan Williams 
---
 drivers/acpi/hmat/hmat.c |   77 ++++++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index b275016ff648..e7ae44c8d359 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -38,6 +38,7 @@ static struct memory_locality *localities_types[4];
 
 struct memory_target {
 	struct list_head node;
+	u64 start, size;
 	unsigned int memory_pxm;
 	unsigned int processor_pxm;
 	struct node_hmem_attrs hmem_attrs;
@@ -63,12 +64,13 @@ static __init struct memory_initiator *find_mem_initiator(unsigned int cpu_pxm)
 	return NULL;
 }
 
-static __init struct memory_target *find_mem_target(unsigned int mem_pxm)
+static __init struct memory_target *find_mem_target(unsigned int mem_pxm,
+		u64 start)
 {
 	struct memory_target *target;
 
 	list_for_each_entry(target, &targets, node)
-		if (target->memory_pxm == mem_pxm)
+		if (target->memory_pxm == mem_pxm && target->start == start)
 			return target;
 	return NULL;
 }
@@ -92,14 +94,15 @@ static __init void alloc_memory_initiator(unsigned int cpu_pxm)
 	list_add_tail(&initiator->node, &initiators);
 }
 
-static __init void alloc_memory_target(unsigned int mem_pxm)
+static __init void alloc_memory_target(unsigned int mem_pxm,
+		u64 start, u64 size)
 {
 	struct memory_target *target;
 
 	if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
 		return;
 
-	target = find_mem_target(mem_pxm);
+	target = find_mem_target(mem_pxm, start);
 	if (target)
 		return;
 
@@ -109,6 +112,8 @@ static __init void alloc_memory_target(unsigned int mem_pxm)
 
 	target->memory_pxm = mem_pxm;
 	target->processor_pxm = PXM_INVAL;
+	target->start = start;
+	target->size = size;
 	list_add_tail(&target->node, &targets);
 }
 
@@ -183,8 +188,8 @@ static __init u32 hmat_normalize(u16 entry, u64 base, u8 type)
 	return value;
 }
 
-static __init void hmat_update_target_access(struct memory_target *target,
-					     u8 type, u32 value)
+static __init void __hmat_update_target_access(struct memory_target *target,
+		u8 type, u32 value)
 {
 	switch (type) {
 	case ACPI_HMAT_ACCESS_LATENCY:
@@ -212,6 +217,20 @@ static __init void hmat_update_target_access(struct memory_target *target,
 	}
 }
 
+static __init void hmat_update_target_access(int memory_pxm, int processor_pxm,
+		u8 type, u32 value)
+{
+	struct memory_target *target;
+
+	list_for_each_entry(target, &targets, node) {
+		if (target->processor_pxm != processor_pxm)
+			continue;
+		if (target->memory_pxm != memory_pxm)
+			continue;
+		__hmat_update_target_access(target, type, value);
+	}
+}
+
 static __init void hmat_add_locality(struct acpi_hmat_locality *hmat_loc)
 {
 	struct memory_locality *loc;
@@ -255,7 +274,6 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
 	struct acpi_hmat_locality *hmat_loc = (void *)header;
-	struct memory_target *target;
 	unsigned int init, targ, total_size, ipds, tpds;
 	u32 *inits, *targs, value;
 	u16 *entries;
@@ -296,11 +314,9 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
 				inits[init], targs[targ], value,
 				hmat_data_type_suffix(type));
 
-			if (mem_hier == ACPI_HMAT_MEMORY) {
-				target = find_mem_target(targs[targ]);
-				if (target && target->processor_pxm == inits[init])
-					hmat_update_target_access(target, type, value);
-			}
+			if (mem_hier == ACPI_HMAT_MEMORY)
+				hmat_update_target_access(targs[targ],
+						inits[init], type, value);
 		}
 	}
 
@@ -367,6 +383,7 @@ static int __init hmat_parse_proximity_domain(union acpi_subtable_headers *heade
 {
 	struct acpi_hmat_proximity_domain *p = (void *)header;
 	struct memory_target *target = NULL;
+	bool found = false;
 
 	if (p->header.length != sizeof(*p)) {
 		pr_notice("HMAT: Unexpected address range header length: %d\n",
@@ -382,23 +399,34 @@ static int __init hmat_parse_proximity_domain(union acpi_subtable_headers *heade
 		pr_info("HMAT: Memory Flags:%04x Processor Domain:%d Memory Domain:%d\n",
 			p->flags, p->processor_PD, p->memory_PD);
 
-	if (p->flags & ACPI_HMAT_MEMORY_PD_VALID) {
-		target = find_mem_target(p->memory_PD);
-		if (!target) {
-			pr_debug("HMAT: Memory Domain missing from SRAT\n");
-			return -EINVAL;
-		}
-	}
-	if (target && p->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
-		int p_node = pxm_to_node(p->processor_PD);
+	if ((p->flags & ACPI_HMAT_MEMORY_PD_VALID) == 0)
+		return 0;
+
+	list_for_each_entry(target, &targets, node) {
+		int p_node;
+
+		if (target->memory_pxm != p->memory_PD)
+			continue;
+		found = true;
 
+		if ((p->flags & ACPI_HMAT_PROCESSOR_PD_VALID) == 0)
+			continue;
+
+		p_node = pxm_to_node(p->processor_PD);
 		if (p_node == NUMA_NO_NODE) {
-			pr_debug("HMAT: Invalid Processor Domain\n");
+			pr_debug("HMAT: Invalid Processor Domain: %d\n",
+					p->processor_PD);
 			return -EINVAL;
 		}
+
 		target->processor_pxm = p_node;
 	}
 
+	if (!found) {
+		pr_debug("HMAT: Memory Domain missing from SRAT for pxm: %d\n",
+				p->memory_PD);
+		return -EINVAL;
+	}
 	return 0;
 }
 
@@ -431,7 +459,7 @@ static __init int srat_parse_mem_affinity(union acpi_subtable_headers *header,
 		return -EINVAL;
 	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
 		return 0;
-	alloc_memory_target(ma->proximity_domain);
+	alloc_memory_target(ma->proximity_domain, ma->base_address, ma->length);
 	return 0;
 }
 
@@ -568,7 +596,8 @@ static __init void hmat_register_target_initiators(struct memory_target *target)
 				clear_bit(initiator->processor_pxm, p_nodes);
 		}
 		if (best)
-			hmat_update_target_access(target, loc->hmat_loc->data_type, best);
+			__hmat_update_target_access(target,
+					loc->hmat_loc->data_type, best);
 	}
 
 	for_each_set_bit(i, p_nodes, MAX_NUMNODES) {

[RFC PATCH 4/5] acpi/hmat: Register special purpose memory as a device

2019-04-04T19:21:32Z

Memory that has been tagged EFI_SPECIAL_PURPOSE, and has performance
properties described by the ACPI HMAT is expected to have an application
specific consumer.

Those consumers may want 100% of the memory capacity to be reserved from
any usage by the kernel. By default, with this enabling, a platform
device is created to represent this differentiated resource.

A follow on change arranges for device-dax to claim these devices by
default and provide an mmap interface for the target application.
However, if the administrator prefers that some or all of the special
purpose memory is made available to the core-mm the device-dax hotplug
facility can be used to online the memory with its own numa node.

Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Keith Busch 
Cc: Jonathan Cameron 
Signed-off-by: Dan Williams 
---
 drivers/acpi/hmat/Kconfig |    1 +
 drivers/acpi/hmat/hmat.c  |   63 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memregion.h |    3 ++
 3 files changed, 67 insertions(+)

diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
index 95a29964dbea..4fcf76e8aa1d 100644
--- a/drivers/acpi/hmat/Kconfig
+++ b/drivers/acpi/hmat/Kconfig
@@ -3,6 +3,7 @@ config ACPI_HMAT
 	bool "ACPI Heterogeneous Memory Attribute Table Support"
 	depends on ACPI_NUMA
 	select HMEM_REPORTING
+	select MEMREGION
 	help
 	 If set, this option has the kernel parse and report the
 	 platform's ACPI HMAT (Heterogeneous Memory Attributes Table),
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index e7ae44c8d359..482360004ea0 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -612,6 +615,65 @@ static __init void hmat_register_target_perf(struct memory_target *target)
 	node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
 }
 
+static __init void hmat_register_target_device(struct memory_target *target)
+{
+	struct memregion_info info;
+	struct resource res = {
+		.start = target->start,
+		.end = target->start + target->size - 1,
+		.flags = IORESOURCE_MEM,
+		.desc = IORES_DESC_APPLICATION_RESERVED,
+	};
+	struct platform_device *pdev;
+	int rc, id;
+
+	if (region_intersects(target->start, target->size, IORESOURCE_MEM,
+				IORES_DESC_APPLICATION_RESERVED)
+			!= REGION_INTERSECTS)
+		return;
+
+	id = memregion_alloc();
+	if (id < 0) {
+		pr_err("acpi/hmat: memregion allocation failure for %pr\n", &res);
+		return;
+	}
+
+	pdev = platform_device_alloc("hmem", id);
+	if (!pdev) {
+		pr_err("acpi/hmat: hmem device allocation failure for %pr\n", &res);
+		goto out_pdev;
+	}
+
+	pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->processor_pxm);
+	info = (struct memregion_info) {
+		.target_node = acpi_map_pxm_to_node(target->memory_pxm),
+	};
+	rc = platform_device_add_data(pdev, &info, sizeof(info));
+	if (rc < 0) {
+		pr_err("acpi/hmat: hmem memregion_info allocation failure for %pr\n", &res);
+		goto out_pdev;
+	}
+
+	rc = platform_device_add_resources(pdev, &res, 1);
+	if (rc < 0) {
+		pr_err("acpi/hmat: hmem resource allocation failure for %pr\n", &res);
+		goto out_resource;
+	}
+
+	rc = platform_device_add(pdev);
+	if (rc < 0) {
+		dev_err(&pdev->dev, "acpi/hmat: device add failed for %pr\n", &res);
+		goto out_resource;
+	}
+
+	return;
+
+out_resource:
+	put_device(&pdev->dev);
+out_pdev:
+	memregion_free(id);
+}
+
 static __init void hmat_register_targets(void)
 {
 	struct memory_target *target;
@@ -619,6 +681,7 @@ static __init void hmat_register_targets(void)
 	list_for_each_entry(target, &targets, node) {
 		hmat_register_target_initiators(target);
 		hmat_register_target_perf(target);
+		hmat_register_target_device(target);
 	}
 }
 
diff --git a/include/linux/memregion.h b/include/linux/memregion.h
index 99fa47793b49..5de2ac7fcf5e 100644
--- a/include/linux/memregion.h
+++ b/include/linux/memregion.h
@@ -1,6 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 #ifndef _MEMREGION_H_
 #define _MEMREGION_H_
+struct memregion_info {
+	int target_node;
+};
 int memregion_alloc(void);
 void memregion_free(int id);
 #endif /* _MEMREGION_H_ */

[RFC PATCH 5/5] device-dax: Add a driver for "hmem" devices

2019-04-04T19:21:37Z

Platform firmware like EFI/ACPI may publish "hmem" platform devices.
Such a device is a performance differentiated memory range likely
reserved for an application specific use case. The driver gives access
to 100% of the capacity via a device-dax mmap instance by default.

However, if over-subscription and other kernel memory management is
desired the resulting dax device can be assigned to the core-mm via the
kmem driver.

Cc: Vishal Verma 
Cc: Keith Busch 
Cc: Dave Jiang 
Signed-off-by: Dan Williams 
---
 drivers/dax/Kconfig  |   26 ++++++++++++++++++----
 drivers/dax/Makefile |    2 ++
 drivers/dax/hmem.c   |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+), 5 deletions(-)
 create mode 100644 drivers/dax/hmem.c

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 5ef624fe3934..b3886f43dd77 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -38,16 +38,32 @@ config DEV_DAX_KMEM
 	depends on DEV_DAX
 	depends on MEMORY_HOTPLUG # for add_memory() and friends
 	help
-	  Support access to persistent memory as if it were RAM.  This
-	  allows easier use of persistent memory by unmodified
-	  applications.
+	  Support access to persistent, or other performance
+	  differentiated memory as if it were System RAM. This allows
+	  easier use of persistent memory by unmodified applications, or
+	  adds core kernel memory services to heterogeneous memory types
+	  (HMEM) marked "reserved" by platform firmware.
 
 	  To use this feature, a DAX device must be unbound from the
-	  device_dax driver (PMEM DAX) and bound to this kmem driver
-	  on each boot.
+	  device_dax driver and bound to this kmem driver on each boot.
 
 	  Say N if unsure.
 
+config DEV_DAX_HMEM
+	tristate "HMEM DAX: generic support for 'special purpose' memory"
+	default DEV_DAX
+	help
+	  EFI 2.8 platforms, and others, may advertise 'special purpose'
+	  memory.  For example, a high bandwidth memory pool. The
+	  indication from platform firmware is meant to reserve the
+	  memory from typical usage by default.  This driver creates
+	  device-dax instances for these memory ranges, and that also
+	  enables the possibility to assign them to the DEV_DAX_KMEM
+	  driver to override the reservation and add them to kernel
+	  "System RAM" pool.
+
+	  Say Y if unsure.
+
 config DEV_DAX_PMEM_COMPAT
 	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
 	depends on DEV_DAX_PMEM
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 81f7d54dadfb..80065b38b3c4 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -2,9 +2,11 @@
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
+obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
 
 dax-y := super.o
 dax-y += bus.o
 device_dax-y := device.o
+dax_hmem-y := hmem.o
 
 obj-y += pmem/
diff --git a/drivers/dax/hmem.c b/drivers/dax/hmem.c
new file mode 100644
index 000000000000..b01cf9c65329
--- /dev/null
+++ b/drivers/dax/hmem.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bus.h"
+
+static int dax_hmem_probe(struct platform_device *pdev)
+{
+	struct dev_pagemap pgmap = { 0 };
+	struct device *dev = &pdev->dev;
+	struct dax_region *dax_region;
+	struct memregion_info *mri;
+	struct dev_dax *dev_dax;
+	struct resource *res;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENOMEM;
+
+	mri = dev->platform_data;
+	pgmap.dev = dev;
+	memcpy(&pgmap.res, res, sizeof(*res));
+
+	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
+			PMD_SIZE, PFN_DEV|PFN_MAP);
+	if (!dax_region)
+		return -ENOMEM;
+
+	dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
+	if (IS_ERR(dev_dax))
+		return PTR_ERR(dev_dax);
+
+	/* child dev_dax instances now own the lifetime of the dax_region */
+	dax_region_put(dax_region);
+	return 0;
+}
+
+static int dax_hmem_remove(struct platform_device *pdev)
+{
+	/* devm handles teardown */
+	return 0;
+}
+
+static struct platform_driver dax_hmem_driver = {
+	.probe = dax_hmem_probe,
+	.remove = dax_hmem_remove,
+	.driver = {
+		.name = "hmem",
+	},
+};
+
+module_platform_driver(dax_hmem_driver);
+
+MODULE_ALIAS("platform:hmem*");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");

Re: [RFC PATCH 2/5] lib/memregion: Uplevel the pmem "region" ida to a global allocator

2019-04-04T19:32:14Z

On Thu, Apr 04, 2019 at 12:08:38PM -0700, Dan Williams wrote:
> +++ b/lib/Kconfig
> @@ -318,6 +318,12 @@ config DECOMPRESS_LZ4
>  config GENERIC_ALLOCATOR
>  	bool
>  
> +#
> +# Generic IDA for memory regions
> +#

Leaky abstraction -- nobody needs know that it's implemented as an IDA.
Suggest:

# Memory region ID allocation

...

> +++ b/lib/memregion.c
> @@ -0,0 +1,22 @@
> +#include 
> +#include 
> +
> +static DEFINE_IDA(region_ida);
> +
> +int memregion_alloc(void)
> +{
> +	return ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
> +}
> +EXPORT_SYMBOL(memregion_alloc);
> +
> +void memregion_free(int id)
> +{
> +	ida_simple_remove(&region_ida, id);
> +}
> +EXPORT_SYMBOL(memregion_free);
> +
> +static void __exit memregion_exit(void)
> +{
> +	ida_destroy(&region_ida);
> +}
> +module_exit(memregion_exit);

 - Should these be EXPORT_SYMBOL_GPL?
 - Can we use the new interface, ida_alloc() and ida_free()?
 - Do we really want memregion_exit() to happen while there are still IDs
   allocated in the IDA?  I think this might well be better as:

	BUG_ON(!ida_empty(&region_ida));

Also, do we really want to call the structure the region_ida?  Why not
region_ids?

Re: [RFC PATCH 3/5] acpi/hmat: Track target address ranges

2019-04-04T20:58:25Z

On Thu, Apr 4, 2019 at 1:56 PM Keith Busch  wrote:
>
> On Thu, Apr 04, 2019 at 12:08:44PM -0700, Dan Williams wrote:
> > As of ACPI 6.3 the HMAT no longer advertises the physical memory address
> > range for its entries. Instead, the expectation is the corresponding
> > entry in the SRAT is looked up by the target proximity domain.
> >
> > Given there may be multiple distinct address ranges that share the same
> > performance profile (sparse address space), find_mem_target() is updated
> > to also consider the start address of the memory range. Target property
> > updates are also adjusted to loop over all possible 'struct target'
> > instances that may share the same proximity domain identification.
>
> Since this may allocate multiple targets with the same PXM,
> hmat_register_targets() will attempt to register the same node multiple
> times.
>
> Would it make sense if the existing struct memory_target adds a resource
> list that we can append to as we parse SRAT? That way we have one target
> per memory node, and also track the ranges.

That sounds reasonable to me.

Re: [RFC PATCH 3/5] acpi/hmat: Track target address ranges

2019-04-04T20:56:52Z

On Thu, Apr 04, 2019 at 12:08:44PM -0700, Dan Williams wrote:
> As of ACPI 6.3 the HMAT no longer advertises the physical memory address
> range for its entries. Instead, the expectation is the corresponding
> entry in the SRAT is looked up by the target proximity domain.
> 
> Given there may be multiple distinct address ranges that share the same
> performance profile (sparse address space), find_mem_target() is updated
> to also consider the start address of the memory range. Target property
> updates are also adjusted to loop over all possible 'struct target'
> instances that may share the same proximity domain identification.

Since this may allocate multiple targets with the same PXM,
hmat_register_targets() will attempt to register the same node multiple
times.

Would it make sense if the existing struct memory_target adds a resource
list that we can append to as we parse SRAT? That way we have one target
per memory node, and also track the ranges.

Re: [RFC PATCH 2/5] lib/memregion: Uplevel the pmem "region" ida to a global allocator

2019-04-04T21:02:44Z

On Thu, Apr 4, 2019 at 12:32 PM Matthew Wilcox  wrote:
>
> On Thu, Apr 04, 2019 at 12:08:38PM -0700, Dan Williams wrote:
> > +++ b/lib/Kconfig
> > @@ -318,6 +318,12 @@ config DECOMPRESS_LZ4
> >  config GENERIC_ALLOCATOR
> >       bool
> >
> > +#
> > +# Generic IDA for memory regions
> > +#
>
> Leaky abstraction -- nobody needs know that it's implemented as an IDA.
> Suggest:
>
> # Memory region ID allocation
>

Looks good to me.

> ...
>
> > +++ b/lib/memregion.c
> > @@ -0,0 +1,22 @@
> > +#include 
> > +#include 
> > +
> > +static DEFINE_IDA(region_ida);
> > +
> > +int memregion_alloc(void)
> > +{
> > +     return ida_simple_get(&region_ida, 0, 0, GFP_KERNEL);
> > +}
> > +EXPORT_SYMBOL(memregion_alloc);
> > +
> > +void memregion_free(int id)
> > +{
> > +     ida_simple_remove(&region_ida, id);
> > +}
> > +EXPORT_SYMBOL(memregion_free);
> > +
> > +static void __exit memregion_exit(void)
> > +{
> > +     ida_destroy(&region_ida);
> > +}
> > +module_exit(memregion_exit);
>
>  - Should these be EXPORT_SYMBOL_GPL?

I don't see the need. These are simple wrappers around existing
EXPORT_SYMBOL() exports, and there's little concern that these
interfaces might disappear in the future causing us pain with out of
tree modules as these don't touch anything in the core.

>  - Can we use the new interface, ida_alloc() and ida_free()?

Sure.

>  - Do we really want memregion_exit() to happen while there are still IDs
>    allocated in the IDA?  I think this might well be better as:
>
>         BUG_ON(!ida_empty(&region_ida));

True, or just delete the module_exit because this functionality can't
be built as a module, so the exit path is already dead code.

> Also, do we really want to call the structure the region_ida?  Why not
> region_ids?

Sure, sounds good.

Re: [RFC PATCH 4/5] acpi/hmat: Register special purpose memory as a device

2019-04-05T11:19:15Z

On Thu, 4 Apr 2019 12:08:49 -0700
Dan Williams  wrote:

> Memory that has been tagged EFI_SPECIAL_PURPOSE, and has performance
> properties described by the ACPI HMAT is expected to have an application
> specific consumer.
> 
> Those consumers may want 100% of the memory capacity to be reserved from
> any usage by the kernel. By default, with this enabling, a platform
> device is created to represent this differentiated resource.
> 
> A follow on change arranges for device-dax to claim these devices by
> default and provide an mmap interface for the target application.
> However, if the administrator prefers that some or all of the special
> purpose memory is made available to the core-mm the device-dax hotplug
> facility can be used to online the memory with its own numa node.
> 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Keith Busch 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 

Hi Dan,

Great to see you getting this discussion going so fast and in
general the approach makes sense to me.

I'm a little confused why HMAT has anything to do with this.
SPM is defined either via the attribute in SRAT SPA entries,
EF_MEMORY_SP or via the EFI memory map.

Whether it is in HMAT or not isn't all that relevant.
Back in the days of the reservation hint (so before yesterday :)
it was relevant obviously but that's no longer true.

So what am I missing?

Thanks,

Jonathan


> ---
>  drivers/acpi/hmat/Kconfig |    1 +
>  drivers/acpi/hmat/hmat.c  |   63 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memregion.h |    3 ++
>  3 files changed, 67 insertions(+)
> 
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> index 95a29964dbea..4fcf76e8aa1d 100644
> --- a/drivers/acpi/hmat/Kconfig
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -3,6 +3,7 @@ config ACPI_HMAT
>  	bool "ACPI Heterogeneous Memory Attribute Table Support"
>  	depends on ACPI_NUMA
>  	select HMEM_REPORTING
> +	select MEMREGION
>  	help
>  	 If set, this option has the kernel parse and report the
>  	 platform's ACPI HMAT (Heterogeneous Memory Attributes Table),
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index e7ae44c8d359..482360004ea0 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -13,6 +13,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -612,6 +615,65 @@ static __init void hmat_register_target_perf(struct memory_target *target)
>  	node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
>  }
>  
> +static __init void hmat_register_target_device(struct memory_target *target)
> +{
> +	struct memregion_info info;
> +	struct resource res = {
> +		.start = target->start,
> +		.end = target->start + target->size - 1,
> +		.flags = IORESOURCE_MEM,
> +		.desc = IORES_DESC_APPLICATION_RESERVED,
> +	};
> +	struct platform_device *pdev;
> +	int rc, id;
> +
> +	if (region_intersects(target->start, target->size, IORESOURCE_MEM,
> +				IORES_DESC_APPLICATION_RESERVED)
> +			!= REGION_INTERSECTS)
> +		return;
> +
> +	id = memregion_alloc();
> +	if (id < 0) {
> +		pr_err("acpi/hmat: memregion allocation failure for %pr\n", &res);
> +		return;
> +	}
> +
> +	pdev = platform_device_alloc("hmem", id);
> +	if (!pdev) {
> +		pr_err("acpi/hmat: hmem device allocation failure for %pr\n", &res);
> +		goto out_pdev;
> +	}
> +
> +	pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->processor_pxm);
> +	info = (struct memregion_info) {
> +		.target_node = acpi_map_pxm_to_node(target->memory_pxm),
> +	};
> +	rc = platform_device_add_data(pdev, &info, sizeof(info));
> +	if (rc < 0) {
> +		pr_err("acpi/hmat: hmem memregion_info allocation failure for %pr\n", &res);
> +		goto out_pdev;
> +	}
> +
> +	rc = platform_device_add_resources(pdev, &res, 1);
> +	if (rc < 0) {
> +		pr_err("acpi/hmat: hmem resource allocation failure for %pr\n", &res);
> +		goto out_resource;
> +	}
> +
> +	rc = platform_device_add(pdev);
> +	if (rc < 0) {
> +		dev_err(&pdev->dev, "acpi/hmat: device add failed for %pr\n", &res);
> +		goto out_resource;
> +	}
> +
> +	return;
> +
> +out_resource:
> +	put_device(&pdev->dev);
> +out_pdev:
> +	memregion_free(id);
> +}
> +
>  static __init void hmat_register_targets(void)
>  {
>  	struct memory_target *target;
> @@ -619,6 +681,7 @@ static __init void hmat_register_targets(void)
>  	list_for_each_entry(target, &targets, node) {
>  		hmat_register_target_initiators(target);
>  		hmat_register_target_perf(target);
> +		hmat_register_target_device(target);
>  	}
>  }
>  
> diff --git a/include/linux/memregion.h b/include/linux/memregion.h
> index 99fa47793b49..5de2ac7fcf5e 100644
> --- a/include/linux/memregion.h
> +++ b/include/linux/memregion.h
> @@ -1,6 +1,9 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #ifndef _MEMREGION_H_
>  #define _MEMREGION_H_
> +struct memregion_info {
> +	int target_node;
> +};
>  int memregion_alloc(void);
>  void memregion_free(int id);
>  #endif /* _MEMREGION_H_ */
>

Re: [RFC PATCH 4/5] acpi/hmat: Register special purpose memory as a device

2019-04-05T15:43:17Z

On Fri, Apr 5, 2019 at 4:19 AM Jonathan Cameron
 wrote:
>
> On Thu, 4 Apr 2019 12:08:49 -0700
> Dan Williams  wrote:
>
> > Memory that has been tagged EFI_SPECIAL_PURPOSE, and has performance
> > properties described by the ACPI HMAT is expected to have an application
> > specific consumer.
> >
> > Those consumers may want 100% of the memory capacity to be reserved from
> > any usage by the kernel. By default, with this enabling, a platform
> > device is created to represent this differentiated resource.
> >
> > A follow on change arranges for device-dax to claim these devices by
> > default and provide an mmap interface for the target application.
> > However, if the administrator prefers that some or all of the special
> > purpose memory is made available to the core-mm the device-dax hotplug
> > facility can be used to online the memory with its own numa node.
> >
> > Cc: "Rafael J. Wysocki" 
> > Cc: Len Brown 
> > Cc: Keith Busch 
> > Cc: Jonathan Cameron 
> > Signed-off-by: Dan Williams 
>
> Hi Dan,
>
> Great to see you getting this discussion going so fast and in
> general the approach makes sense to me.
>
> I'm a little confused why HMAT has anything to do with this.
> SPM is defined either via the attribute in SRAT SPA entries,
> EF_MEMORY_SP or via the EFI memory map.
>
> Whether it is in HMAT or not isn't all that relevant.
> Back in the days of the reservation hint (so before yesterday :)
> it was relevant obviously but that's no longer true.
>
> So what am I missing?

It's a good question, and an assumption I should have explicitly
declared in the changelog. The problem with EFI_MEMORY_SP is the same
as the problem with the EfiPersistentMemory type, it isn't precise
enough on its own for the kernel to delineate 'type' or
device/replaceable-unit boundaries. For example, I expect one
EFI_MEMORY_SP range of a specific type may be contiguous with another
range of a different type. Similar to the NFIT there is no requirement
in the specification that platform firmware inject multiple range
entries. Instead that precision is left to the SRAT + HMAT, or the
NFIT in the case of PMEM.

Conversely, and thinking through this a bit more, if a memory range is
"special", but the platform fails to enumerate it in HMAT I think
Linux should scream loudly that the firmware is broken and leave the
range alone. The "scream loudly" piece is missing in the current set,
but the "leave the range alone" functionality is included.

Re: [RFC PATCH 4/5] acpi/hmat: Register special purpose memory as a device

2019-04-05T16:24:00Z

On Fri, 5 Apr 2019 08:43:03 -0700
Dan Williams  wrote:

> On Fri, Apr 5, 2019 at 4:19 AM Jonathan Cameron
>  wrote:
> >
> > On Thu, 4 Apr 2019 12:08:49 -0700
> > Dan Williams  wrote:
> >  
> > > Memory that has been tagged EFI_SPECIAL_PURPOSE, and has performance
> > > properties described by the ACPI HMAT is expected to have an application
> > > specific consumer.
> > >
> > > Those consumers may want 100% of the memory capacity to be reserved from
> > > any usage by the kernel. By default, with this enabling, a platform
> > > device is created to represent this differentiated resource.
> > >
> > > A follow on change arranges for device-dax to claim these devices by
> > > default and provide an mmap interface for the target application.
> > > However, if the administrator prefers that some or all of the special
> > > purpose memory is made available to the core-mm the device-dax hotplug
> > > facility can be used to online the memory with its own numa node.
> > >
> > > Cc: "Rafael J. Wysocki" 
> > > Cc: Len Brown 
> > > Cc: Keith Busch 
> > > Cc: Jonathan Cameron 
> > > Signed-off-by: Dan Williams   
> >
> > Hi Dan,
> >
> > Great to see you getting this discussion going so fast and in
> > general the approach makes sense to me.
> >
> > I'm a little confused why HMAT has anything to do with this.
> > SPM is defined either via the attribute in SRAT SPA entries,
> > EF_MEMORY_SP or via the EFI memory map.
> >
> > Whether it is in HMAT or not isn't all that relevant.
> > Back in the days of the reservation hint (so before yesterday :)
> > it was relevant obviously but that's no longer true.
> >
> > So what am I missing?  
> 
> It's a good question, and an assumption I should have explicitly
> declared in the changelog. The problem with EFI_MEMORY_SP is the same
> as the problem with the EfiPersistentMemory type, it isn't precise
> enough on its own for the kernel to delineate 'type' or
> device/replaceable-unit boundaries. For example, I expect one
> EFI_MEMORY_SP range of a specific type may be contiguous with another
> range of a different type. Similar to the NFIT there is no requirement
> in the specification that platform firmware inject multiple range
> entries. Instead that precision is left to the SRAT + HMAT, or the
> NFIT in the case of PMEM.

Absolutely, as long as they are all SPM, they could be anywhere in
the system.

> 
> Conversely, and thinking through this a bit more, if a memory range is
> "special", but the platform fails to enumerate it in HMAT I think
> Linux should scream loudly that the firmware is broken and leave the
> range alone. The "scream loudly" piece is missing in the current set,
> but the "leave the range alone" functionality is included.

I am certainly keen on screaming if the various entries are inconsistent
but am not sure they necessarily are here.

So there are a couple of ways we could get an SPM range defined.
The key thing here is that firmware should be attempting to describe
what it has to some degree somewhere.  If not it won't get a good
result ;)  So if there is no SRAT then you are on your own. SCREAM!

1. Directly in the memory map.  If there is no other information then
   tough luck the kernel can only sensibly handle it as one device.
   Or not at all, which seems like a reasonable decision to me.
   SCREAM

2. In memory map + a proximity domain entry in SRAT.  Given memory
   with different characteristics should be in different proximity
   domains anyway - this should be fairly precise. The slight snag
   here is that the fine grained nature of SRAT is actually a side
   effect of HMAT, so not sure well platforms have traditional
   describe their more subtle differences.

3. In NFIT as NFIT SPA carries the memory attribute.  Not sure if
   we should scream if this disagrees with the memory map.

4. In HMAT?  Now this changed in ACPI 6.3 to clean up the 'messy'
   prior relationship between it and SRAT.  Now HMAT no longer has
   memory address ranges as you observed.  That means, to describe
   properties of memory, it has to use the proximity domains of
   SRAT.  It provides lots of additional info about those domains
   but it is SRAT that defines them.

So I would argue that HMAT itself doesn't tell us anything useful.
SRAT certainly does though so I think this should be coming from
SRAT (or NFIT as that also defines the required precision)

Jonathan

Re: [RFC PATCH 4/5] acpi/hmat: Register special purpose memory as a device

2019-04-05T16:56:36Z

On Fri, Apr 5, 2019 at 9:24 AM Jonathan Cameron
 wrote:
>
> On Fri, 5 Apr 2019 08:43:03 -0700
> Dan Williams  wrote:
>
> > On Fri, Apr 5, 2019 at 4:19 AM Jonathan Cameron
> >  wrote:
> > >
> > > On Thu, 4 Apr 2019 12:08:49 -0700
> > > Dan Williams  wrote:
> > >
> > > > Memory that has been tagged EFI_SPECIAL_PURPOSE, and has performance
> > > > properties described by the ACPI HMAT is expected to have an application
> > > > specific consumer.
> > > >
> > > > Those consumers may want 100% of the memory capacity to be reserved from
> > > > any usage by the kernel. By default, with this enabling, a platform
> > > > device is created to represent this differentiated resource.
> > > >
> > > > A follow on change arranges for device-dax to claim these devices by
> > > > default and provide an mmap interface for the target application.
> > > > However, if the administrator prefers that some or all of the special
> > > > purpose memory is made available to the core-mm the device-dax hotplug
> > > > facility can be used to online the memory with its own numa node.
> > > >
> > > > Cc: "Rafael J. Wysocki" 
> > > > Cc: Len Brown 
> > > > Cc: Keith Busch 
> > > > Cc: Jonathan Cameron 
> > > > Signed-off-by: Dan Williams 
> > >
> > > Hi Dan,
> > >
> > > Great to see you getting this discussion going so fast and in
> > > general the approach makes sense to me.
> > >
> > > I'm a little confused why HMAT has anything to do with this.
> > > SPM is defined either via the attribute in SRAT SPA entries,
> > > EF_MEMORY_SP or via the EFI memory map.
> > >
> > > Whether it is in HMAT or not isn't all that relevant.
> > > Back in the days of the reservation hint (so before yesterday :)
> > > it was relevant obviously but that's no longer true.
> > >
> > > So what am I missing?
> >
> > It's a good question, and an assumption I should have explicitly
> > declared in the changelog. The problem with EFI_MEMORY_SP is the same
> > as the problem with the EfiPersistentMemory type, it isn't precise
> > enough on its own for the kernel to delineate 'type' or
> > device/replaceable-unit boundaries. For example, I expect one
> > EFI_MEMORY_SP range of a specific type may be contiguous with another
> > range of a different type. Similar to the NFIT there is no requirement
> > in the specification that platform firmware inject multiple range
> > entries. Instead that precision is left to the SRAT + HMAT, or the
> > NFIT in the case of PMEM.
>
> Absolutely, as long as they are all SPM, they could be anywhere in
> the system.
>
> >
> > Conversely, and thinking through this a bit more, if a memory range is
> > "special", but the platform fails to enumerate it in HMAT I think
> > Linux should scream loudly that the firmware is broken and leave the
> > range alone. The "scream loudly" piece is missing in the current set,
> > but the "leave the range alone" functionality is included.
>
> I am certainly keen on screaming if the various entries are inconsistent
> but am not sure they necessarily are here.
>
> So there are a couple of ways we could get an SPM range defined.
> The key thing here is that firmware should be attempting to describe
> what it has to some degree somewhere.  If not it won't get a good
> result ;)  So if there is no SRAT then you are on your own. SCREAM!
>
> 1. Directly in the memory map.  If there is no other information then
>    tough luck the kernel can only sensibly handle it as one device.
>    Or not at all, which seems like a reasonable decision to me.
>    SCREAM
>
> 2. In memory map + a proximity domain entry in SRAT.  Given memory
>    with different characteristics should be in different proximity
>    domains anyway - this should be fairly precise. The slight snag
>    here is that the fine grained nature of SRAT is actually a side
>    effect of HMAT, so not sure well platforms have traditional
>    describe their more subtle differences.
>
> 3. In NFIT as NFIT SPA carries the memory attribute.  Not sure if
>    we should scream if this disagrees with the memory map.
>
> 4. In HMAT?  Now this changed in ACPI 6.3 to clean up the 'messy'
>    prior relationship between it and SRAT.  Now HMAT no longer has
>    memory address ranges as you observed.  That means, to describe
>    properties of memory, it has to use the proximity domains of
>    SRAT.  It provides lots of additional info about those domains
>    but it is SRAT that defines them.
>
> So I would argue that HMAT itself doesn't tell us anything useful.
> SRAT certainly does though so I think this should be coming from
> SRAT (or NFIT as that also defines the required precision)

I agree, yes, SRAT by itself is sufficient for this "precision"
concern. However, do we, core Linux developers, really want to
encourage platform vendors that they can ignore deploying HMAT data
and get Linux to honor that sub-case for EFI_MEMORY_SP? My personal
experience is that platform firmware will take advantage of almost any
opportunity to minimize the data it provides to the OS. The only hard
lever Linux has to encourage platform firmware to give complete data
is to decline to support configurations that have incomplete data.

Re: [RFC PATCH 4/5] acpi/hmat: Register special purpose memory as a device

2019-04-05T17:40:15Z

On Fri, 5 Apr 2019 09:56:22 -0700
Dan Williams  wrote:

> On Fri, Apr 5, 2019 at 9:24 AM Jonathan Cameron
>  wrote:
> >
> > On Fri, 5 Apr 2019 08:43:03 -0700
> > Dan Williams  wrote:
> >  
> > > On Fri, Apr 5, 2019 at 4:19 AM Jonathan Cameron
> > >  wrote:  
> > > >
> > > > On Thu, 4 Apr 2019 12:08:49 -0700
> > > > Dan Williams  wrote:
> > > >  
> > > > > Memory that has been tagged EFI_SPECIAL_PURPOSE, and has performance
> > > > > properties described by the ACPI HMAT is expected to have an application
> > > > > specific consumer.
> > > > >
> > > > > Those consumers may want 100% of the memory capacity to be reserved from
> > > > > any usage by the kernel. By default, with this enabling, a platform
> > > > > device is created to represent this differentiated resource.
> > > > >
> > > > > A follow on change arranges for device-dax to claim these devices by
> > > > > default and provide an mmap interface for the target application.
> > > > > However, if the administrator prefers that some or all of the special
> > > > > purpose memory is made available to the core-mm the device-dax hotplug
> > > > > facility can be used to online the memory with its own numa node.
> > > > >
> > > > > Cc: "Rafael J. Wysocki" 
> > > > > Cc: Len Brown 
> > > > > Cc: Keith Busch 
> > > > > Cc: Jonathan Cameron 
> > > > > Signed-off-by: Dan Williams   
> > > >
> > > > Hi Dan,
> > > >
> > > > Great to see you getting this discussion going so fast and in
> > > > general the approach makes sense to me.
> > > >
> > > > I'm a little confused why HMAT has anything to do with this.
> > > > SPM is defined either via the attribute in SRAT SPA entries,
> > > > EF_MEMORY_SP or via the EFI memory map.
> > > >
> > > > Whether it is in HMAT or not isn't all that relevant.
> > > > Back in the days of the reservation hint (so before yesterday :)
> > > > it was relevant obviously but that's no longer true.
> > > >
> > > > So what am I missing?  
> > >
> > > It's a good question, and an assumption I should have explicitly
> > > declared in the changelog. The problem with EFI_MEMORY_SP is the same
> > > as the problem with the EfiPersistentMemory type, it isn't precise
> > > enough on its own for the kernel to delineate 'type' or
> > > device/replaceable-unit boundaries. For example, I expect one
> > > EFI_MEMORY_SP range of a specific type may be contiguous with another
> > > range of a different type. Similar to the NFIT there is no requirement
> > > in the specification that platform firmware inject multiple range
> > > entries. Instead that precision is left to the SRAT + HMAT, or the
> > > NFIT in the case of PMEM.  
> >
> > Absolutely, as long as they are all SPM, they could be anywhere in
> > the system.
> >  
> > >
> > > Conversely, and thinking through this a bit more, if a memory range is
> > > "special", but the platform fails to enumerate it in HMAT I think
> > > Linux should scream loudly that the firmware is broken and leave the
> > > range alone. The "scream loudly" piece is missing in the current set,
> > > but the "leave the range alone" functionality is included.  
> >
> > I am certainly keen on screaming if the various entries are inconsistent
> > but am not sure they necessarily are here.
> >
> > So there are a couple of ways we could get an SPM range defined.
> > The key thing here is that firmware should be attempting to describe
> > what it has to some degree somewhere.  If not it won't get a good
> > result ;)  So if there is no SRAT then you are on your own. SCREAM!
> >
> > 1. Directly in the memory map.  If there is no other information then
> >    tough luck the kernel can only sensibly handle it as one device.
> >    Or not at all, which seems like a reasonable decision to me.
> >    SCREAM
> >
> > 2. In memory map + a proximity domain entry in SRAT.  Given memory
> >    with different characteristics should be in different proximity
> >    domains anyway - this should be fairly precise. The slight snag
> >    here is that the fine grained nature of SRAT is actually a side
> >    effect of HMAT, so not sure well platforms have traditional
> >    describe their more subtle differences.
> >
> > 3. In NFIT as NFIT SPA carries the memory attribute.  Not sure if
> >    we should scream if this disagrees with the memory map.
> >
> > 4. In HMAT?  Now this changed in ACPI 6.3 to clean up the 'messy'
> >    prior relationship between it and SRAT.  Now HMAT no longer has
> >    memory address ranges as you observed.  That means, to describe
> >    properties of memory, it has to use the proximity domains of
> >    SRAT.  It provides lots of additional info about those domains
> >    but it is SRAT that defines them.
> >
> > So I would argue that HMAT itself doesn't tell us anything useful.
> > SRAT certainly does though so I think this should be coming from
> > SRAT (or NFIT as that also defines the required precision)  
> 
> I agree, yes, SRAT by itself is sufficient for this "precision"
> concern. However, do we, core Linux developers, really want to
> encourage platform vendors that they can ignore deploying HMAT data
> and get Linux to honor that sub-case for EFI_MEMORY_SP? My personal
> experience is that platform firmware will take advantage of almost any
> opportunity to minimize the data it provides to the OS. The only hard
> lever Linux has to encourage platform firmware to give complete data
> is to decline to support configurations that have incomplete data.
> 

If we decide as a community that this is the way we want to go, I'm
happy to politely point it out to our firmware people (who are a more
proactive group on detailed system descriptions than many!)

If we make this a clearly stated policy, perhaps via some comments
in the code or Documentation/ that that would be even better
and avoid people taking the 'but you could support my firmware'
line in the future.

I'll see if I can reach out to other OS vendors as well so we
can present a unified front on this (perhaps after a few days, just
in case we have any dissenting voices here!)

Thanks,

Jonathan

Re: [RFC PATCH 1/5] efi: Detect UEFI 2.8 Special Purpose Memory

2019-04-06T04:21:14Z

Hi Dan,

On Thu, 4 Apr 2019 at 21:21, Dan Williams  wrote:
>
> UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
> interpretation of the EFI Memory Types as "reserved for a special
> purpose".
>
> The proposed Linux behavior for special purpose memory is that it is
> reserved for direct-access (device-dax) by default and not available for
> any kernel usage, not even as an OOM fallback. Later, through udev
> scripts or another init mechanism, these device-dax claimed ranges can
> be reconfigured and hot-added to the available System-RAM with a unique
> node identifier.
>
> A follow-on patch integrates parsing of the ACPI HMAT to identify the
> node and sub-range boundaries of EFI_MEMORY_SP designated memory. For
> now, arrange for EFI_MEMORY_SP memory to be reserved.
>
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: Ard Biesheuvel 
> Cc: Darren Hart 
> Cc: Andy Shevchenko 
> Signed-off-by: Dan Williams 
> ---
>  arch/x86/Kconfig                  |   18 ++++++++++++++++++
>  arch/x86/boot/compressed/eboot.c  |    5 ++++-
>  arch/x86/boot/compressed/kaslr.c  |    2 +-
>  arch/x86/include/asm/e820/types.h |    9 +++++++++
>  arch/x86/kernel/e820.c            |    9 +++++++--
>  arch/x86/platform/efi/efi.c       |   10 +++++++++-
>  include/linux/efi.h               |   14 ++++++++++++++
>  include/linux/ioport.h            |    1 +
>  8 files changed, 63 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c1f9b3cf437c..cb9ca27de7a5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1961,6 +1961,24 @@ config EFI_MIXED
>
>            If unsure, say N.
>
> +config EFI_SPECIAL_MEMORY
> +       bool "EFI Special Purpose Memory Support"
> +       depends on EFI
> +       ---help---
> +         On systems that have mixed performance classes of memory EFI
> +         may indicate special purpose memory with an attribute (See
> +         EFI_MEMORY_SP in UEFI 2.8). A memory range tagged with this
> +         attribute may have unique performance characteristics compared
> +         to the system's general purpose "System RAM" pool. On the
> +         expectation that such memory has application specific usage
> +         answer Y to arrange for the kernel to reserve it for
> +         direct-access (device-dax) by default. The memory range can
> +         later be optionally assigned to the page allocator by system
> +         administrator policy. Say N to have the kernel treat this
> +         memory as general purpose by default.
> +
> +         If unsure, say Y.
> +

EFI_MEMORY_SP is now part of the UEFI spec proper, so it does not make
sense to make any understanding of it Kconfigurable.

Instead, what I would prefer is to implement support for EFI_MEMORY_SP
unconditionally (including the ability to identify it in the debug
dump of the memory map etc), in a way that all architectures can use
it. Then, I think we should never treat it as ordinary memory and make
it the firmware's problem not to use the EFI_MEMORY_SP attribute in
cases where it results in undesired behavior in the OS.

Also, sInce there is a generic component and a x86 component, can you
please split those up?

You only cc'ed me on patch #1 this time, but could you please cc me on
the entire series for v2? Thanks.


>  config SECCOMP
>         def_bool y
>         prompt "Enable seccomp to safely compute untrusted bytecode"
> diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
> index 544ac4fafd11..9b90fae21abe 100644
> --- a/arch/x86/boot/compressed/eboot.c
> +++ b/arch/x86/boot/compressed/eboot.c
> @@ -560,7 +560,10 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>                 case EFI_BOOT_SERVICES_CODE:
>                 case EFI_BOOT_SERVICES_DATA:
>                 case EFI_CONVENTIONAL_MEMORY:
> -                       e820_type = E820_TYPE_RAM;
> +                       if (is_efi_special(d))
> +                               e820_type = E820_TYPE_SPECIAL;
> +                       else
> +                               e820_type = E820_TYPE_RAM;
>                         break;
>
>                 case EFI_ACPI_MEMORY_NVS:
> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> index 2e53c056ba20..897e46eb9714 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -757,7 +757,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
>                  *
>                  * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
>                  */
> -               if (md->type != EFI_CONVENTIONAL_MEMORY)
> +               if (md->type != EFI_CONVENTIONAL_MEMORY || is_efi_special(md))
>                         continue;
>
>                 if (efi_mirror_found &&
> diff --git a/arch/x86/include/asm/e820/types.h b/arch/x86/include/asm/e820/types.h
> index c3aa4b5e49e2..0ab8abae2e8b 100644
> --- a/arch/x86/include/asm/e820/types.h
> +++ b/arch/x86/include/asm/e820/types.h
> @@ -28,6 +28,15 @@ enum e820_type {
>          */
>         E820_TYPE_PRAM          = 12,
>
> +       /*
> +        * Special-purpose / application-specific memory is indicated to
> +        * the system via the EFI_MEMORY_SP attribute. Define an e820
> +        * translation of this memory type for the purpose of
> +        * reserving this range and marking it with the
> +        * IORES_DESC_APPLICATION_RESERVED designation.
> +        */
> +       E820_TYPE_SPECIAL       = 0xefffffff,
> +
>         /*
>          * Reserved RAM used by the kernel itself if
>          * CONFIG_INTEL_TXT=y is enabled, memory of this type
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 2879e234e193..9f50dd0bbb04 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -176,6 +176,7 @@ static void __init e820_print_type(enum e820_type type)
>         switch (type) {
>         case E820_TYPE_RAM:             /* Fall through: */
>         case E820_TYPE_RESERVED_KERN:   pr_cont("usable");                      break;
> +       case E820_TYPE_SPECIAL:         /* Fall through: */
>         case E820_TYPE_RESERVED:        pr_cont("reserved");                    break;
>         case E820_TYPE_ACPI:            pr_cont("ACPI data");                   break;
>         case E820_TYPE_NVS:             pr_cont("ACPI NVS");                    break;
> @@ -1023,6 +1024,7 @@ static const char *__init e820_type_to_string(struct e820_entry *entry)
>         case E820_TYPE_UNUSABLE:        return "Unusable memory";
>         case E820_TYPE_PRAM:            return "Persistent Memory (legacy)";
>         case E820_TYPE_PMEM:            return "Persistent Memory";
> +       case E820_TYPE_SPECIAL:         /* Fall-through: */
>         case E820_TYPE_RESERVED:        return "Reserved";
>         default:                        return "Unknown E820 type";
>         }
> @@ -1038,6 +1040,7 @@ static unsigned long __init e820_type_to_iomem_type(struct e820_entry *entry)
>         case E820_TYPE_UNUSABLE:        /* Fall-through: */
>         case E820_TYPE_PRAM:            /* Fall-through: */
>         case E820_TYPE_PMEM:            /* Fall-through: */
> +       case E820_TYPE_SPECIAL:         /* Fall-through: */
>         case E820_TYPE_RESERVED:        /* Fall-through: */
>         default:                        return IORESOURCE_MEM;
>         }
> @@ -1050,6 +1053,7 @@ static unsigned long __init e820_type_to_iores_desc(struct e820_entry *entry)
>         case E820_TYPE_NVS:             return IORES_DESC_ACPI_NV_STORAGE;
>         case E820_TYPE_PMEM:            return IORES_DESC_PERSISTENT_MEMORY;
>         case E820_TYPE_PRAM:            return IORES_DESC_PERSISTENT_MEMORY_LEGACY;
> +       case E820_TYPE_SPECIAL:         return IORES_DESC_APPLICATION_RESERVED;
>         case E820_TYPE_RESERVED_KERN:   /* Fall-through: */
>         case E820_TYPE_RAM:             /* Fall-through: */
>         case E820_TYPE_UNUSABLE:        /* Fall-through: */
> @@ -1065,13 +1069,14 @@ static bool __init do_mark_busy(enum e820_type type, struct resource *res)
>                 return true;
>
>         /*
> -        * Treat persistent memory like device memory, i.e. reserve it
> -        * for exclusive use of a driver
> +        * Treat persistent memory and other special memory ranges like
> +        * device memory, i.e. reserve it for exclusive use of a driver
>          */
>         switch (type) {
>         case E820_TYPE_RESERVED:
>         case E820_TYPE_PRAM:
>         case E820_TYPE_PMEM:
> +       case E820_TYPE_SPECIAL:
>                 return false;
>         case E820_TYPE_RESERVED_KERN:
>         case E820_TYPE_RAM:
> diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> index e1cb01a22fa8..d227751f331b 100644
> --- a/arch/x86/platform/efi/efi.c
> +++ b/arch/x86/platform/efi/efi.c
> @@ -139,7 +139,9 @@ static void __init do_add_efi_memmap(void)
>                 case EFI_BOOT_SERVICES_CODE:
>                 case EFI_BOOT_SERVICES_DATA:
>                 case EFI_CONVENTIONAL_MEMORY:
> -                       if (md->attribute & EFI_MEMORY_WB)
> +                       if (is_efi_special(md))
> +                               e820_type = E820_TYPE_SPECIAL;
> +                       else if (md->attribute & EFI_MEMORY_WB)
>                                 e820_type = E820_TYPE_RAM;
>                         else
>                                 e820_type = E820_TYPE_RESERVED;
> @@ -753,6 +755,12 @@ static bool should_map_region(efi_memory_desc_t *md)
>         if (IS_ENABLED(CONFIG_X86_32))
>                 return false;
>
> +       /*
> +        * Special purpose memory is not mapped by default.
> +        */
> +       if (is_efi_special(md))
> +               return false;
> +
>         /*
>          * Map all of RAM so that we can access arguments in the 1:1
>          * mapping when making EFI runtime calls.
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 54357a258b35..cecbc2bda1da 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -112,6 +112,7 @@ typedef     struct {
>  #define EFI_MEMORY_MORE_RELIABLE \
>                                 ((u64)0x0000000000010000ULL)    /* higher reliability */
>  #define EFI_MEMORY_RO          ((u64)0x0000000000020000ULL)    /* read-only */
> +#define EFI_MEMORY_SP          ((u64)0x0000000000040000ULL)    /* special purpose */
>  #define EFI_MEMORY_RUNTIME     ((u64)0x8000000000000000ULL)    /* range requires runtime mapping */
>  #define EFI_MEMORY_DESCRIPTOR_VERSION  1
>
> @@ -128,6 +129,19 @@ typedef struct {
>         u64 attribute;
>  } efi_memory_desc_t;
>
> +#ifdef CONFIG_EFI_SPECIAL_MEMORY
> +static inline bool is_efi_special(efi_memory_desc_t *md)
> +{
> +       return md->type == EFI_CONVENTIONAL_MEMORY
> +               && (md->attribute & EFI_MEMORY_SP);
> +}
> +#else
> +static inline bool is_efi_special(efi_memory_desc_t *md)
> +{
> +       return false;
> +}
> +#endif
> +
>  typedef struct {
>         efi_guid_t guid;
>         u32 headersize;
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index da0ebaec25f0..2d79841ee9b9 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -133,6 +133,7 @@ enum {
>         IORES_DESC_PERSISTENT_MEMORY_LEGACY     = 5,
>         IORES_DESC_DEVICE_PRIVATE_MEMORY        = 6,
>         IORES_DESC_DEVICE_PUBLIC_MEMORY         = 7,
> +       IORES_DESC_APPLICATION_RESERVED         = 8,
>  };
>
>  /* helpers to define resources */
>

Re: [RFC PATCH 4/5] acpi/hmat: Register special purpose memory as a device

2019-04-09T12:13:21Z

On Thu, Apr 04, 2019 at 12:08:49PM -0700, Dan Williams wrote:
> Memory that has been tagged EFI_SPECIAL_PURPOSE, and has performance
> properties described by the ACPI HMAT is expected to have an application
> specific consumer.
> 
> Those consumers may want 100% of the memory capacity to be reserved from
> any usage by the kernel. By default, with this enabling, a platform
> device is created to represent this differentiated resource.

This sounds more than weird.  Since when did we let the firmware decide
who can use the memory?

Re: [RFC PATCH 4/5] acpi/hmat: Register special purpose memory as a device

2019-04-09T14:49:32Z

On Tue, Apr 9, 2019 at 5:13 AM Christoph Hellwig  wrote:
>
> On Thu, Apr 04, 2019 at 12:08:49PM -0700, Dan Williams wrote:
> > Memory that has been tagged EFI_SPECIAL_PURPOSE, and has performance
> > properties described by the ACPI HMAT is expected to have an application
> > specific consumer.
> >
> > Those consumers may want 100% of the memory capacity to be reserved from
> > any usage by the kernel. By default, with this enabling, a platform
> > device is created to represent this differentiated resource.
>
> This sounds more than weird.  Since when did we let the firmware decide
> who can use the memory?

There's 2 related motivations for playing along with this "special
purpose" attribute. Before this bit we've seen gross hacks in platform
firmware trying to game OS behavior by lying about numa distances in
the ACPI SLIT. For example "near" high bandwidth memory being set at a
large distance to prevent the kernel from allocating from it by
default as much as possible. Secondly, allow niche applications
guarantees about being able to claim all of a given designated
resource.

The above comes with the option to override this default reservation
and just turn it back over to the page allocator i.e. ignore the
platform firmware hint.

The alternative is arranging for "special purpose" memory to be given
to the page allocator by default with the hope that it can be reserved
/ claimed early so the administrator can prevent unwanted allocations.
It just seemed overall easier to "default reserve with the option to
hot-add" instead of "default online with option / hope of hot-remove
or early allocation".

Re: [RFC PATCH 1/5] efi: Detect UEFI 2.8 Special Purpose Memory

2019-04-09T16:44:10Z

On Fri, Apr 5, 2019 at 9:21 PM Ard Biesheuvel  wrote:
>
> Hi Dan,
>
> On Thu, 4 Apr 2019 at 21:21, Dan Williams  wrote:
> >
> > UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
> > interpretation of the EFI Memory Types as "reserved for a special
> > purpose".
> >
> > The proposed Linux behavior for special purpose memory is that it is
> > reserved for direct-access (device-dax) by default and not available for
> > any kernel usage, not even as an OOM fallback. Later, through udev
> > scripts or another init mechanism, these device-dax claimed ranges can
> > be reconfigured and hot-added to the available System-RAM with a unique
> > node identifier.
> >
> > A follow-on patch integrates parsing of the ACPI HMAT to identify the
> > node and sub-range boundaries of EFI_MEMORY_SP designated memory. For
> > now, arrange for EFI_MEMORY_SP memory to be reserved.
> >
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: Ard Biesheuvel 
> > Cc: Darren Hart 
> > Cc: Andy Shevchenko 
> > Signed-off-by: Dan Williams 
> > ---
> >  arch/x86/Kconfig                  |   18 ++++++++++++++++++
> >  arch/x86/boot/compressed/eboot.c  |    5 ++++-
> >  arch/x86/boot/compressed/kaslr.c  |    2 +-
> >  arch/x86/include/asm/e820/types.h |    9 +++++++++
> >  arch/x86/kernel/e820.c            |    9 +++++++--
> >  arch/x86/platform/efi/efi.c       |   10 +++++++++-
> >  include/linux/efi.h               |   14 ++++++++++++++
> >  include/linux/ioport.h            |    1 +
> >  8 files changed, 63 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index c1f9b3cf437c..cb9ca27de7a5 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1961,6 +1961,24 @@ config EFI_MIXED
> >
> >            If unsure, say N.
> >
> > +config EFI_SPECIAL_MEMORY
> > +       bool "EFI Special Purpose Memory Support"
> > +       depends on EFI
> > +       ---help---
> > +         On systems that have mixed performance classes of memory EFI
> > +         may indicate special purpose memory with an attribute (See
> > +         EFI_MEMORY_SP in UEFI 2.8). A memory range tagged with this
> > +         attribute may have unique performance characteristics compared
> > +         to the system's general purpose "System RAM" pool. On the
> > +         expectation that such memory has application specific usage
> > +         answer Y to arrange for the kernel to reserve it for
> > +         direct-access (device-dax) by default. The memory range can
> > +         later be optionally assigned to the page allocator by system
> > +         administrator policy. Say N to have the kernel treat this
> > +         memory as general purpose by default.
> > +
> > +         If unsure, say Y.
> > +
>
> EFI_MEMORY_SP is now part of the UEFI spec proper, so it does not make
> sense to make any understanding of it Kconfigurable.

No, I think you're misunderstanding what this Kconfig option is trying
to achieve.

The configuration capability is solely for the default kernel policy.
As can already be seen by Christoph's response [1] the thought that
the firmware gets more leeway to dictate to Linux memory policy may be
objectionable.

[1]: https://lore.kernel.org/lkml/20190409121318.GA16955@infradead.org/

So the Kconfig option is gating whether the kernel simply ignores the
attribute and gives it to the page allocator by default. Anything
fancier, like sub-dividing how much is OS managed vs device-dax
accessed requires the OS to reserve it all from the page-allocator by
default until userspace policy can be applied.

> Instead, what I would prefer is to implement support for EFI_MEMORY_SP
> unconditionally (including the ability to identify it in the debug
> dump of the memory map etc), in a way that all architectures can use
> it. Then, I think we should never treat it as ordinary memory and make
> it the firmware's problem not to use the EFI_MEMORY_SP attribute in
> cases where it results in undesired behavior in the OS.

No, a policy of "never treat it as ordinary memory" confuses the base
intent of the attribute which is an optional hint to get the OS to not
put immovable / non-critical allocations in what could be a precious
resource.

Moreover, the interface for platform firmware to indicate that a
memory range should never be treated as ordinary memory is simply the
existing "reserved" memory type, not this attribute. That's the
mechanism to use when platform firmware knows that a driver is needed
for a given mmio resource.

> Also, sInce there is a generic component and a x86 component, can you
> please split those up?

Sure, can do.

>
> You only cc'ed me on patch #1 this time, but could you please cc me on
> the entire series for v2? Thanks.

Yes, will do, and thanks for taking a look.

Re: [RFC PATCH 1/5] efi: Detect UEFI 2.8 Special Purpose Memory

2019-04-09T17:21:29Z

On Tue, 9 Apr 2019 at 09:44, Dan Williams  wrote:
>
> On Fri, Apr 5, 2019 at 9:21 PM Ard Biesheuvel  wrote:
> >
> > Hi Dan,
> >
> > On Thu, 4 Apr 2019 at 21:21, Dan Williams  wrote:
> > >
> > > UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
> > > interpretation of the EFI Memory Types as "reserved for a special
> > > purpose".
> > >
> > > The proposed Linux behavior for special purpose memory is that it is
> > > reserved for direct-access (device-dax) by default and not available for
> > > any kernel usage, not even as an OOM fallback. Later, through udev
> > > scripts or another init mechanism, these device-dax claimed ranges can
> > > be reconfigured and hot-added to the available System-RAM with a unique
> > > node identifier.
> > >
> > > A follow-on patch integrates parsing of the ACPI HMAT to identify the
> > > node and sub-range boundaries of EFI_MEMORY_SP designated memory. For
> > > now, arrange for EFI_MEMORY_SP memory to be reserved.
> > >
> > > Cc: Thomas Gleixner 
> > > Cc: Ingo Molnar 
> > > Cc: Borislav Petkov 
> > > Cc: "H. Peter Anvin" 
> > > Cc: Ard Biesheuvel 
> > > Cc: Darren Hart 
> > > Cc: Andy Shevchenko 
> > > Signed-off-by: Dan Williams 
> > > ---
> > >  arch/x86/Kconfig                  |   18 ++++++++++++++++++
> > >  arch/x86/boot/compressed/eboot.c  |    5 ++++-
> > >  arch/x86/boot/compressed/kaslr.c  |    2 +-
> > >  arch/x86/include/asm/e820/types.h |    9 +++++++++
> > >  arch/x86/kernel/e820.c            |    9 +++++++--
> > >  arch/x86/platform/efi/efi.c       |   10 +++++++++-
> > >  include/linux/efi.h               |   14 ++++++++++++++
> > >  include/linux/ioport.h            |    1 +
> > >  8 files changed, 63 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index c1f9b3cf437c..cb9ca27de7a5 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -1961,6 +1961,24 @@ config EFI_MIXED
> > >
> > >            If unsure, say N.
> > >
> > > +config EFI_SPECIAL_MEMORY
> > > +       bool "EFI Special Purpose Memory Support"
> > > +       depends on EFI
> > > +       ---help---
> > > +         On systems that have mixed performance classes of memory EFI
> > > +         may indicate special purpose memory with an attribute (See
> > > +         EFI_MEMORY_SP in UEFI 2.8). A memory range tagged with this
> > > +         attribute may have unique performance characteristics compared
> > > +         to the system's general purpose "System RAM" pool. On the
> > > +         expectation that such memory has application specific usage
> > > +         answer Y to arrange for the kernel to reserve it for
> > > +         direct-access (device-dax) by default. The memory range can
> > > +         later be optionally assigned to the page allocator by system
> > > +         administrator policy. Say N to have the kernel treat this
> > > +         memory as general purpose by default.
> > > +
> > > +         If unsure, say Y.
> > > +
> >
> > EFI_MEMORY_SP is now part of the UEFI spec proper, so it does not make
> > sense to make any understanding of it Kconfigurable.
>
> No, I think you're misunderstanding what this Kconfig option is trying
> to achieve.
>
> The configuration capability is solely for the default kernel policy.
> As can already be seen by Christoph's response [1] the thought that
> the firmware gets more leeway to dictate to Linux memory policy may be
> objectionable.
>
> [1]: https://lore.kernel.org/lkml/20190409121318.GA16955@infradead.org/
>
> So the Kconfig option is gating whether the kernel simply ignores the
> attribute and gives it to the page allocator by default. Anything
> fancier, like sub-dividing how much is OS managed vs device-dax
> accessed requires the OS to reserve it all from the page-allocator by
> default until userspace policy can be applied.
>

I don't think this policy should dictate whether we pretend that the
attribute doesn't exist in the first place. We should just wire up the
bit fully, and only apply this policy at the very end.

> > Instead, what I would prefer is to implement support for EFI_MEMORY_SP
> > unconditionally (including the ability to identify it in the debug
> > dump of the memory map etc), in a way that all architectures can use
> > it. Then, I think we should never treat it as ordinary memory and make
> > it the firmware's problem not to use the EFI_MEMORY_SP attribute in
> > cases where it results in undesired behavior in the OS.
>
> No, a policy of "never treat it as ordinary memory" confuses the base
> intent of the attribute which is an optional hint to get the OS to not
> put immovable / non-critical allocations in what could be a precious
> resource.
>

The base intent is to prevent the OS from using memory that is
co-located with an accelerator for any purpose other than what the
accelerator needs it for. Having a Kconfigurable policy that may be
disabled kind of misses the point IMO. I think 'optional hint' doesn't
quite capture the intent.

> Moreover, the interface for platform firmware to indicate that a
> memory range should never be treated as ordinary memory is simply the
> existing "reserved" memory type, not this attribute. That's the
> mechanism to use when platform firmware knows that a driver is needed
> for a given mmio resource.
>

Reserved memory is memory that simply should never touched at all by
the OS, and on ARM, we take care never to map it anywhere. However, it
could be annotated with the EFI_MEMORY_RUNTIME attribute in order for
the OS to provide a virtual mapping for it on behalf of the runtime
services, which is why it needs to be listed in the memory map at all.
This has nothing to do with usable memory that should or not should be
used in a certain way by the OS.

> > Also, sInce there is a generic component and a x86 component, can you
> > please split those up?
>
> Sure, can do.
>
> >
> > You only cc'ed me on patch #1 this time, but could you please cc me on
> > the entire series for v2? Thanks.
>
> Yes, will do, and thanks for taking a look.

Re: [RFC PATCH 1/5] efi: Detect UEFI 2.8 Special Purpose Memory

2019-04-10T02:11:08Z

On Tue, Apr 9, 2019 at 10:21 AM Ard Biesheuvel
 wrote:
>
> On Tue, 9 Apr 2019 at 09:44, Dan Williams  wrote:
> >
> > On Fri, Apr 5, 2019 at 9:21 PM Ard Biesheuvel  wrote:
> > >
> > > Hi Dan,
> > >
> > > On Thu, 4 Apr 2019 at 21:21, Dan Williams  wrote:
> > > >
> > > > UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
> > > > interpretation of the EFI Memory Types as "reserved for a special
> > > > purpose".
> > > >
> > > > The proposed Linux behavior for special purpose memory is that it is
> > > > reserved for direct-access (device-dax) by default and not available for
> > > > any kernel usage, not even as an OOM fallback. Later, through udev
> > > > scripts or another init mechanism, these device-dax claimed ranges can
> > > > be reconfigured and hot-added to the available System-RAM with a unique
> > > > node identifier.
> > > >
> > > > A follow-on patch integrates parsing of the ACPI HMAT to identify the
> > > > node and sub-range boundaries of EFI_MEMORY_SP designated memory. For
> > > > now, arrange for EFI_MEMORY_SP memory to be reserved.
> > > >
> > > > Cc: Thomas Gleixner 
> > > > Cc: Ingo Molnar 
> > > > Cc: Borislav Petkov 
> > > > Cc: "H. Peter Anvin" 
> > > > Cc: Ard Biesheuvel 
> > > > Cc: Darren Hart 
> > > > Cc: Andy Shevchenko 
> > > > Signed-off-by: Dan Williams 
> > > > ---
> > > >  arch/x86/Kconfig                  |   18 ++++++++++++++++++
> > > >  arch/x86/boot/compressed/eboot.c  |    5 ++++-
> > > >  arch/x86/boot/compressed/kaslr.c  |    2 +-
> > > >  arch/x86/include/asm/e820/types.h |    9 +++++++++
> > > >  arch/x86/kernel/e820.c            |    9 +++++++--
> > > >  arch/x86/platform/efi/efi.c       |   10 +++++++++-
> > > >  include/linux/efi.h               |   14 ++++++++++++++
> > > >  include/linux/ioport.h            |    1 +
> > > >  8 files changed, 63 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > index c1f9b3cf437c..cb9ca27de7a5 100644
> > > > --- a/arch/x86/Kconfig
> > > > +++ b/arch/x86/Kconfig
> > > > @@ -1961,6 +1961,24 @@ config EFI_MIXED
> > > >
> > > >            If unsure, say N.
> > > >
> > > > +config EFI_SPECIAL_MEMORY
> > > > +       bool "EFI Special Purpose Memory Support"
> > > > +       depends on EFI
> > > > +       ---help---
> > > > +         On systems that have mixed performance classes of memory EFI
> > > > +         may indicate special purpose memory with an attribute (See
> > > > +         EFI_MEMORY_SP in UEFI 2.8). A memory range tagged with this
> > > > +         attribute may have unique performance characteristics compared
> > > > +         to the system's general purpose "System RAM" pool. On the
> > > > +         expectation that such memory has application specific usage
> > > > +         answer Y to arrange for the kernel to reserve it for
> > > > +         direct-access (device-dax) by default. The memory range can
> > > > +         later be optionally assigned to the page allocator by system
> > > > +         administrator policy. Say N to have the kernel treat this
> > > > +         memory as general purpose by default.
> > > > +
> > > > +         If unsure, say Y.
> > > > +
> > >
> > > EFI_MEMORY_SP is now part of the UEFI spec proper, so it does not make
> > > sense to make any understanding of it Kconfigurable.
> >
> > No, I think you're misunderstanding what this Kconfig option is trying
> > to achieve.
> >
> > The configuration capability is solely for the default kernel policy.
> > As can already be seen by Christoph's response [1] the thought that
> > the firmware gets more leeway to dictate to Linux memory policy may be
> > objectionable.
> >
> > [1]: https://lore.kernel.org/lkml/20190409121318.GA16955@infradead.org/
> >
> > So the Kconfig option is gating whether the kernel simply ignores the
> > attribute and gives it to the page allocator by default. Anything
> > fancier, like sub-dividing how much is OS managed vs device-dax
> > accessed requires the OS to reserve it all from the page-allocator by
> > default until userspace policy can be applied.
> >
>
> I don't think this policy should dictate whether we pretend that the
> attribute doesn't exist in the first place. We should just wire up the
> bit fully, and only apply this policy at the very end.

The bit is just a policy hint, if the kernel is not implementing any
of the policy why even check for the bit?

>
> > > Instead, what I would prefer is to implement support for EFI_MEMORY_SP
> > > unconditionally (including the ability to identify it in the debug
> > > dump of the memory map etc), in a way that all architectures can use
> > > it. Then, I think we should never treat it as ordinary memory and make
> > > it the firmware's problem not to use the EFI_MEMORY_SP attribute in
> > > cases where it results in undesired behavior in the OS.
> >
> > No, a policy of "never treat it as ordinary memory" confuses the base
> > intent of the attribute which is an optional hint to get the OS to not
> > put immovable / non-critical allocations in what could be a precious
> > resource.
> >
>
> The base intent is to prevent the OS from using memory that is
> co-located with an accelerator for any purpose other than what the
> accelerator needs it for. Having a Kconfigurable policy that may be
> disabled kind of misses the point IMO. I think 'optional hint' doesn't
> quite capture the intent.

That's not my understanding, and an EFI attribute is the wrong
mechanism to meet such a requirement. If this memory is specifically
meant for use with a given accelerator then it had better be marked
reserved and the accelerator driver is then responsible for publishing
the resource to the OS if at all.

You did prompt me to go back and re-read the wording in the spec. It
still seems clear to me that the attribute is an optional hint not a
hard requirement. Whether the OS honors an optional hint is an OS
policy and I fail to see why the OS should bother to detect the bit
without implementing any associated policy.

> > Moreover, the interface for platform firmware to indicate that a
> > memory range should never be treated as ordinary memory is simply the
> > existing "reserved" memory type, not this attribute. That's the
> > mechanism to use when platform firmware knows that a driver is needed
> > for a given mmio resource.
> >
>
> Reserved memory is memory that simply should never touched at all by
> the OS, and on ARM, we take care never to map it anywhere.

That's not a guarantee, at least on x86. Some shipping persistent
memory platforms describe it as reserved and then the ACPI NFIT
further describes what that reserved memory contains and how the OS
can use it. See commit af1996ef59db "ACPI: Change NFIT driver to
insert new resource".

Re: [RFC PATCH 1/5] efi: Detect UEFI 2.8 Special Purpose Memory

2019-04-12T20:43:56Z

On Tue, 9 Apr 2019 at 19:11, Dan Williams  wrote:
>
> On Tue, Apr 9, 2019 at 10:21 AM Ard Biesheuvel
>  wrote:
> >
> > On Tue, 9 Apr 2019 at 09:44, Dan Williams  wrote:
> > >
> > > On Fri, Apr 5, 2019 at 9:21 PM Ard Biesheuvel  wrote:
> > > >
> > > > Hi Dan,
> > > >
> > > > On Thu, 4 Apr 2019 at 21:21, Dan Williams  wrote:
> > > > >
> > > > > UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
> > > > > interpretation of the EFI Memory Types as "reserved for a special
> > > > > purpose".
> > > > >
> > > > > The proposed Linux behavior for special purpose memory is that it is
> > > > > reserved for direct-access (device-dax) by default and not available for
> > > > > any kernel usage, not even as an OOM fallback. Later, through udev
> > > > > scripts or another init mechanism, these device-dax claimed ranges can
> > > > > be reconfigured and hot-added to the available System-RAM with a unique
> > > > > node identifier.
> > > > >
> > > > > A follow-on patch integrates parsing of the ACPI HMAT to identify the
> > > > > node and sub-range boundaries of EFI_MEMORY_SP designated memory. For
> > > > > now, arrange for EFI_MEMORY_SP memory to be reserved.
> > > > >
> > > > > Cc: Thomas Gleixner 
> > > > > Cc: Ingo Molnar 
> > > > > Cc: Borislav Petkov 
> > > > > Cc: "H. Peter Anvin" 
> > > > > Cc: Ard Biesheuvel 
> > > > > Cc: Darren Hart 
> > > > > Cc: Andy Shevchenko 
> > > > > Signed-off-by: Dan Williams 
> > > > > ---
> > > > >  arch/x86/Kconfig                  |   18 ++++++++++++++++++
> > > > >  arch/x86/boot/compressed/eboot.c  |    5 ++++-
> > > > >  arch/x86/boot/compressed/kaslr.c  |    2 +-
> > > > >  arch/x86/include/asm/e820/types.h |    9 +++++++++
> > > > >  arch/x86/kernel/e820.c            |    9 +++++++--
> > > > >  arch/x86/platform/efi/efi.c       |   10 +++++++++-
> > > > >  include/linux/efi.h               |   14 ++++++++++++++
> > > > >  include/linux/ioport.h            |    1 +
> > > > >  8 files changed, 63 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > > index c1f9b3cf437c..cb9ca27de7a5 100644
> > > > > --- a/arch/x86/Kconfig
> > > > > +++ b/arch/x86/Kconfig
> > > > > @@ -1961,6 +1961,24 @@ config EFI_MIXED
> > > > >
> > > > >            If unsure, say N.
> > > > >
> > > > > +config EFI_SPECIAL_MEMORY
> > > > > +       bool "EFI Special Purpose Memory Support"
> > > > > +       depends on EFI
> > > > > +       ---help---
> > > > > +         On systems that have mixed performance classes of memory EFI
> > > > > +         may indicate special purpose memory with an attribute (See
> > > > > +         EFI_MEMORY_SP in UEFI 2.8). A memory range tagged with this
> > > > > +         attribute may have unique performance characteristics compared
> > > > > +         to the system's general purpose "System RAM" pool. On the
> > > > > +         expectation that such memory has application specific usage
> > > > > +         answer Y to arrange for the kernel to reserve it for
> > > > > +         direct-access (device-dax) by default. The memory range can
> > > > > +         later be optionally assigned to the page allocator by system
> > > > > +         administrator policy. Say N to have the kernel treat this
> > > > > +         memory as general purpose by default.
> > > > > +
> > > > > +         If unsure, say Y.
> > > > > +
> > > >
> > > > EFI_MEMORY_SP is now part of the UEFI spec proper, so it does not make
> > > > sense to make any understanding of it Kconfigurable.
> > >
> > > No, I think you're misunderstanding what this Kconfig option is trying
> > > to achieve.
> > >
> > > The configuration capability is solely for the default kernel policy.
> > > As can already be seen by Christoph's response [1] the thought that
> > > the firmware gets more leeway to dictate to Linux memory policy may be
> > > objectionable.
> > >
> > > [1]: https://lore.kernel.org/lkml/20190409121318.GA16955@infradead.org/
> > >
> > > So the Kconfig option is gating whether the kernel simply ignores the
> > > attribute and gives it to the page allocator by default. Anything
> > > fancier, like sub-dividing how much is OS managed vs device-dax
> > > accessed requires the OS to reserve it all from the page-allocator by
> > > default until userspace policy can be applied.
> > >
> >
> > I don't think this policy should dictate whether we pretend that the
> > attribute doesn't exist in the first place. We should just wire up the
> > bit fully, and only apply this policy at the very end.
>
> The bit is just a policy hint, if the kernel is not implementing any
> of the policy why even check for the bit?
>

Because I would like things like the EFI memory map dumping code etc
to report the bit regardless of whether we are honoring it or not.

> >
> > > > Instead, what I would prefer is to implement support for EFI_MEMORY_SP
> > > > unconditionally (including the ability to identify it in the debug
> > > > dump of the memory map etc), in a way that all architectures can use
> > > > it. Then, I think we should never treat it as ordinary memory and make
> > > > it the firmware's problem not to use the EFI_MEMORY_SP attribute in
> > > > cases where it results in undesired behavior in the OS.
> > >
> > > No, a policy of "never treat it as ordinary memory" confuses the base
> > > intent of the attribute which is an optional hint to get the OS to not
> > > put immovable / non-critical allocations in what could be a precious
> > > resource.
> > >
> >
> > The base intent is to prevent the OS from using memory that is
> > co-located with an accelerator for any purpose other than what the
> > accelerator needs it for. Having a Kconfigurable policy that may be
> > disabled kind of misses the point IMO. I think 'optional hint' doesn't
> > quite capture the intent.
>
> That's not my understanding, and an EFI attribute is the wrong
> mechanism to meet such a requirement. If this memory is specifically
> meant for use with a given accelerator then it had better be marked
> reserved and the accelerator driver is then responsible for publishing
> the resource to the OS if at all.
>
> You did prompt me to go back and re-read the wording in the spec. It
> still seems clear to me that the attribute is an optional hint not a
> hard requirement. Whether the OS honors an optional hint is an OS
> policy and I fail to see why the OS should bother to detect the bit
> without implementing any associated policy.
>

Because not taking a hint is not the same thing as pretending it isn't
there to begin with.

> > > Moreover, the interface for platform firmware to indicate that a
> > > memory range should never be treated as ordinary memory is simply the
> > > existing "reserved" memory type, not this attribute. That's the
> > > mechanism to use when platform firmware knows that a driver is needed
> > > for a given mmio resource.
> > >
> >
> > Reserved memory is memory that simply should never touched at all by
> > the OS, and on ARM, we take care never to map it anywhere.
>
> That's not a guarantee, at least on x86. Some shipping persistent
> memory platforms describe it as reserved and then the ACPI NFIT
> further describes what that reserved memory contains and how the OS
> can use it. See commit af1996ef59db "ACPI: Change NFIT driver to
> insert new resource".

The UEFI spec is pretty clear about the fact that reserved memory
shouldn't ever be touched by the OS. The fact that x86 platforms exist
that violate this doesn't mean we should abuse it in other ways as
well.

Re: [RFC PATCH 1/5] efi: Detect UEFI 2.8 Special Purpose Memory

2019-04-12T21:18:25Z

On Fri, Apr 12, 2019 at 1:44 PM Ard Biesheuvel
 wrote:
[..]
> > > I don't think this policy should dictate whether we pretend that the
> > > attribute doesn't exist in the first place. We should just wire up the
> > > bit fully, and only apply this policy at the very end.
> >
> > The bit is just a policy hint, if the kernel is not implementing any
> > of the policy why even check for the bit?
> >
>
> Because I would like things like the EFI memory map dumping code etc
> to report the bit regardless of whether we are honoring it or not.

Ok, I'll split it out just for reporting purposes, and come up with a
different mechanism to indicate whether the OS might not be honoring
the expectations of the attribute.

[..]
> Because not taking a hint is not the same thing as pretending it isn't
> there to begin with.

True, and I was missing the enabling to go update where the kernel
goes to report attributes, but for the applications that care they
will still want to debug when the kernel may be placing unwanted
allocations in the "special purpose" range.

> > > > Moreover, the interface for platform firmware to indicate that a
> > > > memory range should never be treated as ordinary memory is simply the
> > > > existing "reserved" memory type, not this attribute. That's the
> > > > mechanism to use when platform firmware knows that a driver is needed
> > > > for a given mmio resource.
> > > >
> > >
> > > Reserved memory is memory that simply should never touched at all by
> > > the OS, and on ARM, we take care never to map it anywhere.
> >
> > That's not a guarantee, at least on x86. Some shipping persistent
> > memory platforms describe it as reserved and then the ACPI NFIT
> > further describes what that reserved memory contains and how the OS
> > can use it. See commit af1996ef59db "ACPI: Change NFIT driver to
> > insert new resource".
>
> The UEFI spec is pretty clear about the fact that reserved memory
> shouldn't ever be touched by the OS. The fact that x86 platforms exist
> that violate this doesn't mean we should abuse it in other ways as
> well.

I think we're talking about 2 different "reserved" memory types, and
it was my fault for not being precise enough. The e820 reserved memory
type has been used for things like PCI memory-mapped I/O or other
memory ranges for which the OS should expect a device-driver to claim.
So when I said EFI_RESERVED_TYPE is safe to use as driver memory I
literally meant this interpretation from do_add_efi_memmap():

                default:
                        /*
                         * EFI_RESERVED_TYPE EFI_RUNTIME_SERVICES_CODE
                         * EFI_RUNTIME_SERVICES_DATA EFI_MEMORY_MAPPED_IO
                         * EFI_MEMORY_MAPPED_IO_PORT_SPACE EFI_PAL_CODE
                         */
                        e820_type = E820_TYPE_RESERVED;
                        break;

...where EFI_RESERVED_TYPE is identical to EFI_MEMORY_MAPPED_IO
relative to E820_TYPE_RESERVED.

The policy taken by these patches is that EFI_CONVENTIONAL_MEMORY
marked with the EFI_MEMORY_SP attribute is treated as
E820_TYPE_RESERVED by default and given to the device-dax driver with
the option to hotplug it as E820_TYPE_RAM at a later time with its own
numa description.

I'm generally pushing back on the argument that EFI_MEMORY_SP ==
EFI_RESERVED_TYPE, especially when the type is explicitly set to
EFI_CONVENTIONAL_MEMORY.

Re: [RFC PATCH 1/5] efi: Detect UEFI 2.8 Special Purpose Memory

2019-04-15T11:44:17Z

On 09.04.19 18:43, Dan Williams wrote:

>>> UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
>>> interpretation of the EFI Memory Types as "reserved for a special
>>> purpose".

What exactly is that suppose to mean ?
Little pacmans might come around and eat the bits ? :o
Who writes specs like that ? What drugs do did take ?


I vote for calling this the lore-ipsum-area ;-)


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287