All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv5 00/10] Heterogeneuos memory node attributes
@ 2019-01-24 23:07 Keith Busch
  2019-01-24 23:07 ` [PATCHv5 01/10] acpi: Create subtable parsing infrastructure Keith Busch
                   ` (12 more replies)
  0 siblings, 13 replies; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

== Changes since v4 ==

  All public interfaces have kernel docs.

  Renamed "class" to "access", docs and changed logs updated
  accordingly. (Rafael)

  The sysfs hierarchy is altered to put initiators and targets in their
  own attribute group directories (Rafael).

  The node lists are removed. This feedback is in conflict with v1
  feedback, but consensus wants to remove multi-value sysfs attributes,
  which includes lists. We only have symlinks now, just like v1 provided.

  Documentation and code patches are combined such that the code
  introducing new attributes and its documentation are in the same
  patch. (Rafael and Dan).

  The performance attributes, bandwidth and latency, are moved into the
  initiators directory. This should make it obvious for which node
  access the attributes apply, which was previously ambiguous.
  (Jonathan Cameron).

  The HMAT code selecting "local" initiators is substantially changed.
  Only PXM's that have identical performance to the HMAT's processor PXM
  in Address Range Structure are registered. This is to avoid considering
  nodes identical when only one of several perf attributes are the same.
  (Jonathan Cameron).

  Verbose variable naming. Examples include "initiator" and "target"
  instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
  "p". (Rafael)

  Compile fixes for when HMEM_REPORTING is not set. This is not a user
  selectable config option, default 'n', and will have to be selected
  by other config options that require it (Greg KH and Rafael).

== Background ==

Platforms may provide multiple types of cpu attached system memory. The
memory ranges for each type may have different characteristics that
applications may wish to know about when considering what node they want
their memory allocated from. 

It had previously been difficult to describe these setups as memory
rangers were generally lumped into the NUMA node of the CPUs. New
platform attributes have been created and in use today that describe
the more complex memory hierarchies that can be created.

This series' objective is to provide the attributes from such systems
that are useful for applications to know about, and readily usable with
existing tools and libraries.

Keith Busch (10):
  acpi: Create subtable parsing infrastructure
  acpi: Add HMAT to generic parsing tables
  acpi/hmat: Parse and report heterogeneous memory
  node: Link memory nodes to their compute nodes
  acpi/hmat: Register processor domain to its memory
  node: Add heterogenous memory access attributes
  acpi/hmat: Register performance attributes
  node: Add memory caching attributes
  acpi/hmat: Register memory side cache attributes
  doc/mm: New documentation for memory performance

 Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
 Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
 arch/arm64/kernel/acpi_numa.c                 |   2 +-
 arch/arm64/kernel/smp.c                       |   4 +-
 arch/ia64/kernel/acpi.c                       |  12 +-
 arch/x86/kernel/acpi/boot.c                   |  36 +-
 drivers/acpi/Kconfig                          |   1 +
 drivers/acpi/Makefile                         |   1 +
 drivers/acpi/hmat/Kconfig                     |   9 +
 drivers/acpi/hmat/Makefile                    |   1 +
 drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
 drivers/acpi/numa.c                           |  16 +-
 drivers/acpi/scan.c                           |   4 +-
 drivers/acpi/tables.c                         |  76 +++-
 drivers/base/Kconfig                          |   8 +
 drivers/base/node.c                           | 354 ++++++++++++++++-
 drivers/irqchip/irq-gic-v2m.c                 |   2 +-
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
 drivers/irqchip/irq-gic-v3-its.c              |   6 +-
 drivers/irqchip/irq-gic-v3.c                  |  10 +-
 drivers/irqchip/irq-gic.c                     |   4 +-
 drivers/mailbox/pcc.c                         |   2 +-
 include/linux/acpi.h                          |   6 +-
 include/linux/node.h                          |  60 ++-
 25 files changed, 1344 insertions(+), 65 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/numaperf.rst
 create mode 100644 drivers/acpi/hmat/Kconfig
 create mode 100644 drivers/acpi/hmat/Makefile
 create mode 100644 drivers/acpi/hmat/hmat.c

-- 
2.14.4

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCHv5 01/10] acpi: Create subtable parsing infrastructure
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-01-24 23:07 ` [PATCHv5 02/10] acpi: Add HMAT to generic parsing tables Keith Busch
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Parsing entries in an ACPI table had assumed a generic header
structure. There is no standard ACPI header, though, so less common
layouts with different field sizes required custom parsers to go through
their subtable entry list.

Create the infrastructure for adding different table types so parsing
the entries array may be more reused for all ACPI system tables and
the common code doesn't need to be duplicated.

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 arch/arm64/kernel/acpi_numa.c                 |  2 +-
 arch/arm64/kernel/smp.c                       |  4 +-
 arch/ia64/kernel/acpi.c                       | 12 ++---
 arch/x86/kernel/acpi/boot.c                   | 36 +++++++-------
 drivers/acpi/numa.c                           | 16 +++----
 drivers/acpi/scan.c                           |  4 +-
 drivers/acpi/tables.c                         | 67 +++++++++++++++++++++++----
 drivers/irqchip/irq-gic-v2m.c                 |  2 +-
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |  2 +-
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |  2 +-
 drivers/irqchip/irq-gic-v3-its.c              |  6 +--
 drivers/irqchip/irq-gic-v3.c                  | 10 ++--
 drivers/irqchip/irq-gic.c                     |  4 +-
 drivers/mailbox/pcc.c                         |  2 +-
 include/linux/acpi.h                          |  5 +-
 15 files changed, 112 insertions(+), 62 deletions(-)

diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index eac1d0cc595c..7ff800045434 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -45,7 +45,7 @@ static inline int get_cpu_for_acpi_id(u32 uid)
 	return -EINVAL;
 }
 
-static int __init acpi_parse_gicc_pxm(struct acpi_subtable_header *header,
+static int __init acpi_parse_gicc_pxm(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
 	struct acpi_srat_gicc_affinity *pa;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 1598d6f7200a..e6a148604dcc 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -553,7 +553,7 @@ acpi_map_gic_cpu_interface(struct acpi_madt_generic_interrupt *processor)
 }
 
 static int __init
-acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header,
+acpi_parse_gic_cpu_interface(union acpi_subtable_headers *header,
 			     const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *processor;
@@ -562,7 +562,7 @@ acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header,
 	if (BAD_MADT_GICC_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	acpi_map_gic_cpu_interface(processor);
 
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 41eb281709da..3973d2c2a9b0 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -177,7 +177,7 @@ struct acpi_table_madt *acpi_madt __initdata;
 static u8 has_8259;
 
 static int __init
-acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
+acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header,
 			  const unsigned long end)
 {
 	struct acpi_madt_local_apic_override *lapic;
@@ -216,7 +216,7 @@ acpi_parse_lsapic(struct acpi_subtable_header * header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic_nmi *lacpi_nmi;
 
@@ -230,7 +230,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e
 }
 
 static int __init
-acpi_parse_iosapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_iosapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_io_sapic *iosapic;
 
@@ -245,7 +245,7 @@ acpi_parse_iosapic(struct acpi_subtable_header * header, const unsigned long end
 static unsigned int __initdata acpi_madt_rev;
 
 static int __init
-acpi_parse_plat_int_src(struct acpi_subtable_header * header,
+acpi_parse_plat_int_src(union acpi_subtable_headers * header,
 			const unsigned long end)
 {
 	struct acpi_madt_interrupt_source *plintsrc;
@@ -329,7 +329,7 @@ unsigned int get_cpei_target_cpu(void)
 }
 
 static int __init
-acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
+acpi_parse_int_src_ovr(union acpi_subtable_headers * header,
 		       const unsigned long end)
 {
 	struct acpi_madt_interrupt_override *p;
@@ -350,7 +350,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_nmi_source *nmi_src;
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 2624de16cd7a..b694a32f95d4 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -197,7 +197,7 @@ static int acpi_register_lapic(int id, u32 acpiid, u8 enabled)
 }
 
 static int __init
-acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
+acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
 {
 	struct acpi_madt_local_x2apic *processor = NULL;
 #ifdef CONFIG_X86_X2APIC
@@ -210,7 +210,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 #ifdef CONFIG_X86_X2APIC
 	apic_id = processor->local_apic_id;
@@ -242,7 +242,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic *processor = NULL;
 
@@ -251,7 +251,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* Ignore invalid ID */
 	if (processor->id == 0xff)
@@ -272,7 +272,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
+acpi_parse_sapic(union acpi_subtable_headers *header, const unsigned long end)
 {
 	struct acpi_madt_local_sapic *processor = NULL;
 
@@ -281,7 +281,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	acpi_register_lapic((processor->id << 8) | processor->eid,/* APIC ID */
 			    processor->processor_id, /* ACPI ID */
@@ -291,7 +291,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
+acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header,
 			  const unsigned long end)
 {
 	struct acpi_madt_local_apic_override *lapic_addr_ovr = NULL;
@@ -301,7 +301,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
 	if (BAD_MADT_ENTRY(lapic_addr_ovr, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	acpi_lapic_addr = lapic_addr_ovr->address;
 
@@ -309,7 +309,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
+acpi_parse_x2apic_nmi(union acpi_subtable_headers *header,
 		      const unsigned long end)
 {
 	struct acpi_madt_local_x2apic_nmi *x2apic_nmi = NULL;
@@ -319,7 +319,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
 	if (BAD_MADT_ENTRY(x2apic_nmi, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (x2apic_nmi->lint != 1)
 		printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n");
@@ -328,7 +328,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic_nmi *lapic_nmi = NULL;
 
@@ -337,7 +337,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e
 	if (BAD_MADT_ENTRY(lapic_nmi, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (lapic_nmi->lint != 1)
 		printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n");
@@ -449,7 +449,7 @@ static int __init mp_register_ioapic_irq(u8 bus_irq, u8 polarity,
 }
 
 static int __init
-acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_ioapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_io_apic *ioapic = NULL;
 	struct ioapic_domain_cfg cfg = {
@@ -462,7 +462,7 @@ acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end)
 	if (BAD_MADT_ENTRY(ioapic, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* Statically assign IRQ numbers for IOAPICs hosting legacy IRQs */
 	if (ioapic->global_irq_base < nr_legacy_irqs())
@@ -508,7 +508,7 @@ static void __init acpi_sci_ioapic_setup(u8 bus_irq, u16 polarity, u16 trigger,
 }
 
 static int __init
-acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
+acpi_parse_int_src_ovr(union acpi_subtable_headers * header,
 		       const unsigned long end)
 {
 	struct acpi_madt_interrupt_override *intsrc = NULL;
@@ -518,7 +518,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 	if (BAD_MADT_ENTRY(intsrc, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (intsrc->source_irq == acpi_gbl_FADT.sci_interrupt) {
 		acpi_sci_ioapic_setup(intsrc->source_irq,
@@ -550,7 +550,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_nmi_source *nmi_src = NULL;
 
@@ -559,7 +559,7 @@ acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end
 	if (BAD_MADT_ENTRY(nmi_src, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* TBD: Support nimsrc entries? */
 
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 7bbbf8256a41..d6433b14864c 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -338,7 +338,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
 }
 
 static int __init
-acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
+acpi_parse_x2apic_affinity(union acpi_subtable_headers *header,
 			   const unsigned long end)
 {
 	struct acpi_srat_x2apic_cpu_affinity *processor_affinity;
@@ -347,7 +347,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_x2apic_affinity_init(processor_affinity);
@@ -356,7 +356,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_processor_affinity(struct acpi_subtable_header *header,
+acpi_parse_processor_affinity(union acpi_subtable_headers *header,
 			      const unsigned long end)
 {
 	struct acpi_srat_cpu_affinity *processor_affinity;
@@ -365,7 +365,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_processor_affinity_init(processor_affinity);
@@ -374,7 +374,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
+acpi_parse_gicc_affinity(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	struct acpi_srat_gicc_affinity *processor_affinity;
@@ -383,7 +383,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_gicc_affinity_init(processor_affinity);
@@ -394,7 +394,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
 static int __initdata parsed_numa_memblks;
 
 static int __init
-acpi_parse_memory_affinity(struct acpi_subtable_header * header,
+acpi_parse_memory_affinity(union acpi_subtable_headers * header,
 			   const unsigned long end)
 {
 	struct acpi_srat_mem_affinity *memory_affinity;
@@ -403,7 +403,7 @@ acpi_parse_memory_affinity(struct acpi_subtable_header * header,
 	if (!memory_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	if (!acpi_numa_memory_affinity_init(memory_affinity))
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 5efd4219f112..a894ce3556f2 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -2240,10 +2240,10 @@ static struct acpi_probe_entry *ape;
 static int acpi_probe_count;
 static DEFINE_MUTEX(acpi_probe_mutex);
 
-static int __init acpi_match_madt(struct acpi_subtable_header *header,
+static int __init acpi_match_madt(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
-	if (!ape->subtable_valid || ape->subtable_valid(header, ape))
+	if (!ape->subtable_valid || ape->subtable_valid(&header->common, ape))
 		if (!ape->probe_subtbl(header, end))
 			acpi_probe_count++;
 
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 48eabb6c2d4f..967e1168becf 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -49,6 +49,15 @@ static struct acpi_table_desc initial_tables[ACPI_MAX_TABLES] __initdata;
 
 static int acpi_apic_instance __initdata;
 
+enum acpi_subtable_type {
+	ACPI_SUBTABLE_COMMON,
+};
+
+struct acpi_subtable_entry {
+	union acpi_subtable_headers *hdr;
+	enum acpi_subtable_type type;
+};
+
 /*
  * Disable table checksum verification for the early stage due to the size
  * limitation of the current x86 early mapping implementation.
@@ -217,6 +226,42 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
 	}
 }
 
+static unsigned long __init
+acpi_get_entry_type(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return entry->hdr->common.type;
+	}
+	return 0;
+}
+
+static unsigned long __init
+acpi_get_entry_length(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return entry->hdr->common.length;
+	}
+	return 0;
+}
+
+static unsigned long __init
+acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return sizeof(entry->hdr->common);
+	}
+	return 0;
+}
+
+static enum acpi_subtable_type __init
+acpi_get_subtable_type(char *id)
+{
+	return ACPI_SUBTABLE_COMMON;
+}
+
 /**
  * acpi_parse_entries_array - for each proc_num find a suitable subtable
  *
@@ -246,8 +291,8 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 		struct acpi_subtable_proc *proc, int proc_num,
 		unsigned int max_entries)
 {
-	struct acpi_subtable_header *entry;
-	unsigned long table_end;
+	struct acpi_subtable_entry entry;
+	unsigned long table_end, subtable_len, entry_len;
 	int count = 0;
 	int errs = 0;
 	int i;
@@ -270,19 +315,20 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 
 	/* Parse all entries looking for a match. */
 
-	entry = (struct acpi_subtable_header *)
+	entry.type = acpi_get_subtable_type(id);
+	entry.hdr = (union acpi_subtable_headers *)
 	    ((unsigned long)table_header + table_size);
+	subtable_len = acpi_get_subtable_header_length(&entry);
 
-	while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
-	       table_end) {
+	while (((unsigned long)entry.hdr) + subtable_len  < table_end) {
 		if (max_entries && count >= max_entries)
 			break;
 
 		for (i = 0; i < proc_num; i++) {
-			if (entry->type != proc[i].id)
+			if (acpi_get_entry_type(&entry) != proc[i].id)
 				continue;
 			if (!proc[i].handler ||
-			     (!errs && proc[i].handler(entry, table_end))) {
+			     (!errs && proc[i].handler(entry.hdr, table_end))) {
 				errs++;
 				continue;
 			}
@@ -297,13 +343,14 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 		 * If entry->length is 0, break from this loop to avoid
 		 * infinite loop.
 		 */
-		if (entry->length == 0) {
+		entry_len = acpi_get_entry_length(&entry);
+		if (entry_len == 0) {
 			pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
 			return -EINVAL;
 		}
 
-		entry = (struct acpi_subtable_header *)
-		    ((unsigned long)entry + entry->length);
+		entry.hdr = (union acpi_subtable_headers *)
+		    ((unsigned long)entry.hdr + entry_len);
 	}
 
 	if (max_entries && count > max_entries) {
diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
index f5fe0100f9ff..de14e06fd9ec 100644
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -446,7 +446,7 @@ static struct fwnode_handle *gicv2m_get_fwnode(struct device *dev)
 }
 
 static int __init
-acpi_parse_madt_msi(struct acpi_subtable_header *header,
+acpi_parse_madt_msi(union acpi_subtable_headers *header,
 		    const unsigned long end)
 {
 	int ret;
diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
index 8d6d009d1d58..c81d5b81da56 100644
--- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
@@ -159,7 +159,7 @@ static int __init its_pci_of_msi_init(void)
 #ifdef CONFIG_ACPI
 
 static int __init
-its_pci_msi_parse_madt(struct acpi_subtable_header *header,
+its_pci_msi_parse_madt(union acpi_subtable_headers *header,
 		       const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3-its-platform-msi.c b/drivers/irqchip/irq-gic-v3-its-platform-msi.c
index 7b8e87b493fe..9cdcda5bb3bd 100644
--- a/drivers/irqchip/irq-gic-v3-its-platform-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-platform-msi.c
@@ -117,7 +117,7 @@ static int __init its_pmsi_init_one(struct fwnode_handle *fwnode,
 
 #ifdef CONFIG_ACPI
 static int __init
-its_pmsi_parse_madt(struct acpi_subtable_header *header,
+its_pmsi_parse_madt(union acpi_subtable_headers *header,
 			const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index db20e992a40f..d6677075d68f 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -3764,13 +3764,13 @@ static int __init acpi_get_its_numa_node(u32 its_id)
 	return NUMA_NO_NODE;
 }
 
-static int __init gic_acpi_match_srat_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_srat_its(union acpi_subtable_headers *header,
 					  const unsigned long end)
 {
 	return 0;
 }
 
-static int __init gic_acpi_parse_srat_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_parse_srat_its(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	int node;
@@ -3837,7 +3837,7 @@ static int __init acpi_get_its_numa_node(u32 its_id) { return NUMA_NO_NODE; }
 static void __init acpi_its_srat_maps_free(void) { }
 #endif
 
-static int __init gic_acpi_parse_madt_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_parse_madt_its(union acpi_subtable_headers *header,
 					  const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 0868a9d81c3c..44db6b809c52 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1392,7 +1392,7 @@ gic_acpi_register_redist(phys_addr_t phys_base, void __iomem *redist_base)
 }
 
 static int __init
-gic_acpi_parse_madt_redist(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_redist(union acpi_subtable_headers *header,
 			   const unsigned long end)
 {
 	struct acpi_madt_generic_redistributor *redist =
@@ -1410,7 +1410,7 @@ gic_acpi_parse_madt_redist(struct acpi_subtable_header *header,
 }
 
 static int __init
-gic_acpi_parse_madt_gicc(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_gicc(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *gicc =
@@ -1452,14 +1452,14 @@ static int __init gic_acpi_collect_gicr_base(void)
 	return -ENODEV;
 }
 
-static int __init gic_acpi_match_gicr(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_gicr(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
 	/* Subtable presence means that redist exists, that's it */
 	return 0;
 }
 
-static int __init gic_acpi_match_gicc(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_gicc(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *gicc =
@@ -1525,7 +1525,7 @@ static bool __init acpi_validate_gic_table(struct acpi_subtable_header *header,
 	return true;
 }
 
-static int __init gic_acpi_parse_virt_madt_gicc(struct acpi_subtable_header *header,
+static int __init gic_acpi_parse_virt_madt_gicc(union acpi_subtable_headers *header,
 						const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *gicc =
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index ba2a37a27a54..a749d73f8337 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -1508,7 +1508,7 @@ static struct
 } acpi_data __initdata;
 
 static int __init
-gic_acpi_parse_madt_cpu(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_cpu(union acpi_subtable_headers *header,
 			const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *processor;
@@ -1540,7 +1540,7 @@ gic_acpi_parse_madt_cpu(struct acpi_subtable_header *header,
 }
 
 /* The things you have to do to just *count* something... */
-static int __init acpi_dummy_func(struct acpi_subtable_header *header,
+static int __init acpi_dummy_func(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
 	return 0;
diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index 256f18b67e8a..08a0a3517138 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -382,7 +382,7 @@ static const struct mbox_chan_ops pcc_chan_ops = {
  *
  * This gets called for each entry in the PCC table.
  */
-static int parse_pcc_subspace(struct acpi_subtable_header *header,
+static int parse_pcc_subspace(union acpi_subtable_headers *header,
 		const unsigned long end)
 {
 	struct acpi_pcct_subspace *ss = (struct acpi_pcct_subspace *) header;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 87715f20b69a..7c3c4ebaded6 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -141,10 +141,13 @@ enum acpi_address_range_id {
 
 
 /* Table Handlers */
+union acpi_subtable_headers {
+	struct acpi_subtable_header common;
+};
 
 typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
 
-typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
+typedef int (*acpi_tbl_entry_handler)(union acpi_subtable_headers *header,
 				      const unsigned long end);
 
 /* Debugger support */
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 02/10] acpi: Add HMAT to generic parsing tables
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
  2019-01-24 23:07 ` [PATCHv5 01/10] acpi: Create subtable parsing infrastructure Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-01-24 23:07 ` [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory Keith Busch
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

The Heterogeneous Memory Attribute Table (HMAT) header has different
field lengths than the existing parsing uses. Add the HMAT type to the
parsing rules so it may be generically parsed.

Cc: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/tables.c | 9 +++++++++
 include/linux/acpi.h  | 1 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 967e1168becf..d9911cd55edc 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -51,6 +51,7 @@ static int acpi_apic_instance __initdata;
 
 enum acpi_subtable_type {
 	ACPI_SUBTABLE_COMMON,
+	ACPI_SUBTABLE_HMAT,
 };
 
 struct acpi_subtable_entry {
@@ -232,6 +233,8 @@ acpi_get_entry_type(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return entry->hdr->common.type;
+	case ACPI_SUBTABLE_HMAT:
+		return entry->hdr->hmat.type;
 	}
 	return 0;
 }
@@ -242,6 +245,8 @@ acpi_get_entry_length(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return entry->hdr->common.length;
+	case ACPI_SUBTABLE_HMAT:
+		return entry->hdr->hmat.length;
 	}
 	return 0;
 }
@@ -252,6 +257,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return sizeof(entry->hdr->common);
+	case ACPI_SUBTABLE_HMAT:
+		return sizeof(entry->hdr->hmat);
 	}
 	return 0;
 }
@@ -259,6 +266,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
 static enum acpi_subtable_type __init
 acpi_get_subtable_type(char *id)
 {
+	if (strncmp(id, ACPI_SIG_HMAT, 4) == 0)
+		return ACPI_SUBTABLE_HMAT;
 	return ACPI_SUBTABLE_COMMON;
 }
 
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 7c3c4ebaded6..53f93dff171c 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -143,6 +143,7 @@ enum acpi_address_range_id {
 /* Table Handlers */
 union acpi_subtable_headers {
 	struct acpi_subtable_header common;
+	struct acpi_hmat_structure hmat;
 };
 
 typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
  2019-01-24 23:07 ` [PATCHv5 01/10] acpi: Create subtable parsing infrastructure Keith Busch
  2019-01-24 23:07 ` [PATCHv5 02/10] acpi: Add HMAT to generic parsing tables Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-02-05 12:12     ` Rafael J. Wysocki
  2019-02-06 12:28     ` Jonathan Cameron
  2019-01-24 23:07 ` [PATCHv5 04/10] node: Link memory nodes to their compute nodes Keith Busch
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Systems may provide different memory types and export this information
in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
tables provided by the platform and report the memory access and caching
attributes to the kernel messages.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/Kconfig       |   1 +
 drivers/acpi/Makefile      |   1 +
 drivers/acpi/hmat/Kconfig  |   8 ++
 drivers/acpi/hmat/Makefile |   1 +
 drivers/acpi/hmat/hmat.c   | 181 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 192 insertions(+)
 create mode 100644 drivers/acpi/hmat/Kconfig
 create mode 100644 drivers/acpi/hmat/Makefile
 create mode 100644 drivers/acpi/hmat/hmat.c

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 90ff0a47c12e..b377f970adfd 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -465,6 +465,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
 	  If you are unsure what to do, do not enable this option.
 
 source "drivers/acpi/nfit/Kconfig"
+source "drivers/acpi/hmat/Kconfig"
 
 source "drivers/acpi/apei/Kconfig"
 source "drivers/acpi/dptf/Kconfig"
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index bb857421c2e8..5d361e4e3405 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -80,6 +80,7 @@ obj-$(CONFIG_ACPI_PROCESSOR)	+= processor.o
 obj-$(CONFIG_ACPI)		+= container.o
 obj-$(CONFIG_ACPI_THERMAL)	+= thermal.o
 obj-$(CONFIG_ACPI_NFIT)		+= nfit/
+obj-$(CONFIG_ACPI_HMAT)		+= hmat/
 obj-$(CONFIG_ACPI)		+= acpi_memhotplug.o
 obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
 obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
new file mode 100644
index 000000000000..c9637e2e7514
--- /dev/null
+++ b/drivers/acpi/hmat/Kconfig
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+config ACPI_HMAT
+	bool "ACPI Heterogeneous Memory Attribute Table Support"
+	depends on ACPI_NUMA
+	help
+	 If set, this option causes the kernel to set the memory NUMA node
+	 relationships and access attributes in accordance with ACPI HMAT
+	 (Heterogeneous Memory Attributes Table).
diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile
new file mode 100644
index 000000000000..e909051d3d00
--- /dev/null
+++ b/drivers/acpi/hmat/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_ACPI_HMAT) := hmat.o
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
new file mode 100644
index 000000000000..1741bf30d87f
--- /dev/null
+++ b/drivers/acpi/hmat/hmat.c
@@ -0,0 +1,181 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Intel Corporation.
+ *
+ * Heterogeneous Memory Attributes Table (HMAT) representation
+ *
+ * This program parses and reports the platform's HMAT tables, and registers
+ * the applicable attributes with the node's interfaces.
+ */
+
+#include <linux/acpi.h>
+#include <linux/bitops.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/node.h>
+#include <linux/sysfs.h>
+
+static __init const char *hmat_data_type(u8 type)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+		return "Access Latency";
+	case ACPI_HMAT_READ_LATENCY:
+		return "Read Latency";
+	case ACPI_HMAT_WRITE_LATENCY:
+		return "Write Latency";
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+		return "Access Bandwidth";
+	case ACPI_HMAT_READ_BANDWIDTH:
+		return "Read Bandwidth";
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		return "Write Bandwidth";
+	default:
+		return "Reserved";
+	};
+}
+
+static __init const char *hmat_data_type_suffix(u8 type)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+	case ACPI_HMAT_READ_LATENCY:
+	case ACPI_HMAT_WRITE_LATENCY:
+		return " nsec";
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+	case ACPI_HMAT_READ_BANDWIDTH:
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		return " MB/s";
+	default:
+		return "";
+	};
+}
+
+static __init int hmat_parse_locality(union acpi_subtable_headers *header,
+				      const unsigned long end)
+{
+	struct acpi_hmat_locality *hmat_loc = (void *)header;
+	unsigned int init, targ, total_size, ipds, tpds;
+	u32 *inits, *targs, value;
+	u16 *entries;
+	u8 type;
+
+	if (hmat_loc->header.length < sizeof(*hmat_loc)) {
+		pr_debug("HMAT: Unexpected locality header length: %d\n",
+			 hmat_loc->header.length);
+		return -EINVAL;
+	}
+
+	type = hmat_loc->data_type;
+	ipds = hmat_loc->number_of_initiator_Pds;
+	tpds = hmat_loc->number_of_target_Pds;
+	total_size = sizeof(*hmat_loc) + sizeof(*entries) * ipds * tpds +
+		     sizeof(*inits) * ipds + sizeof(*targs) * tpds;
+	if (hmat_loc->header.length < total_size) {
+		pr_debug("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
+			 hmat_loc->header.length, total_size);
+		return -EINVAL;
+	}
+
+	pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
+		hmat_loc->flags, hmat_data_type(type), ipds, tpds,
+		hmat_loc->entry_base_unit);
+
+	inits = (u32 *)(hmat_loc + 1);
+	targs = &inits[ipds];
+	entries = (u16 *)(&targs[tpds]);
+	for (init = 0; init < ipds; init++) {
+		for (targ = 0; targ < tpds; targ++) {
+			value = entries[init * tpds + targ];
+			value = (value * hmat_loc->entry_base_unit) / 10;
+			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
+				inits[init], targs[targ], value,
+				hmat_data_type_suffix(type));
+		}
+	}
+
+	return 0;
+}
+
+static __init int hmat_parse_cache(union acpi_subtable_headers *header,
+				   const unsigned long end)
+{
+	struct acpi_hmat_cache *cache = (void *)header;
+	u32 attrs;
+
+	if (cache->header.length < sizeof(*cache)) {
+		pr_debug("HMAT: Unexpected cache header length: %d\n",
+			 cache->header.length);
+		return -EINVAL;
+	}
+
+	attrs = cache->cache_attributes;
+	pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
+		cache->memory_PD, cache->cache_size, attrs,
+		cache->number_of_SMBIOShandles);
+
+	return 0;
+}
+
+static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
+					   const unsigned long end)
+{
+	struct acpi_hmat_address_range *spa = (void *)header;
+
+	if (spa->header.length != sizeof(*spa)) {
+		pr_debug("HMAT: Unexpected address range header length: %d\n",
+			 spa->header.length);
+		return -EINVAL;
+	}
+	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
+		spa->physical_address_base, spa->physical_address_length,
+		spa->flags, spa->processor_PD, spa->memory_PD);
+
+	return 0;
+}
+
+static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
+				      const unsigned long end)
+{
+	struct acpi_hmat_structure *hdr = (void *)header;
+
+	if (!hdr)
+		return -EINVAL;
+
+	switch (hdr->type) {
+	case ACPI_HMAT_TYPE_ADDRESS_RANGE:
+		return hmat_parse_address_range(header, end);
+	case ACPI_HMAT_TYPE_LOCALITY:
+		return hmat_parse_locality(header, end);
+	case ACPI_HMAT_TYPE_CACHE:
+		return hmat_parse_cache(header, end);
+	default:
+		return -EINVAL;
+	}
+}
+
+static __init int hmat_init(void)
+{
+	struct acpi_table_header *tbl;
+	enum acpi_hmat_type i;
+	acpi_status status;
+
+	if (srat_disabled())
+		return 0;
+
+	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+	if (ACPI_FAILURE(status))
+		return 0;
+
+	for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
+		if (acpi_table_parse_entries(ACPI_SIG_HMAT,
+					     sizeof(struct acpi_table_hmat), i,
+					     hmat_parse_subtable, 0) < 0)
+			goto out_put;
+	}
+out_put:
+	acpi_put_table(tbl);
+	return 0;
+}
+subsys_initcall(hmat_init);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
                   ` (2 preceding siblings ...)
  2019-01-24 23:07 ` [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-02-05 12:33     ` Rafael J. Wysocki
                     ` (2 more replies)
  2019-01-24 23:07 ` [PATCHv5 05/10] acpi/hmat: Register processor domain to its memory Keith Busch
                   ` (8 subsequent siblings)
  12 siblings, 3 replies; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Systems may be constructed with various specialized nodes. Some nodes
may provide memory, some provide compute devices that access and use
that memory, and others may provide both. Nodes that provide memory are
referred to as memory targets, and nodes that can initiate memory access
are referred to as memory initiators.

Memory targets will often have varying access characteristics from
different initiators, and platforms may have ways to express those
relationships. In preparation for these systems, provide interfaces for
the kernel to export the memory relationship among different nodes memory
targets and their initiators with symlinks to each other.

If a system provides access locality for each initiator-target pair, nodes
may be grouped into ranked access classes relative to other nodes. The
new interface allows a subsystem to register relationships of varying
classes if available and desired to be exported.

A memory initiator may have multiple memory targets in the same access
class. The target memory's initiators in a given class indicate the
nodes access characteristics share the same performance relative to other
linked initiator nodes. Each target within an initiator's access class,
though, do not necessarily perform the same as each other.

A memory target node may have multiple memory initiators. All linked
initiators in a target's class have the same access characteristics to
that target.

The following example show the nodes' new sysfs hierarchy for a memory
target node 'Y' with access class 0 from initiator node 'X':

  # symlinks -v /sys/devices/system/node/nodeX/access0/
  relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY

  # symlinks -v /sys/devices/system/node/nodeY/access0/
  relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX

The new attributes are added to the sysfs stable documentation.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/ABI/stable/sysfs-devices-node |  25 ++++-
 drivers/base/node.c                         | 142 +++++++++++++++++++++++++++-
 include/linux/node.h                        |   7 +-
 3 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 3e90e1f3bf0a..fb843222a281 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -90,4 +90,27 @@ Date:		December 2009
 Contact:	Lee Schermerhorn <lee.schermerhorn@hp.com>
 Description:
 		The node's huge page size control/query attributes.
-		See Documentation/admin-guide/mm/hugetlbpage.rst
\ No newline at end of file
+		See Documentation/admin-guide/mm/hugetlbpage.rst
+
+What:		/sys/devices/system/node/nodeX/accessY/
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The node's relationship to other nodes for access class "Y".
+
+What:		/sys/devices/system/node/nodeX/accessY/initiators/
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The directory containing symlinks to memory initiator
+		nodes that have class "Y" access to this target node's
+		memory. CPUs and other memory initiators in nodes not in
+		the list accessing this node's memory may have different
+		performance.
+
+What:		/sys/devices/system/node/nodeX/classY/targets/
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The directory containing symlinks to memory targets that
+		this initiator node has class "Y" access.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..6f4097680580 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -17,6 +17,7 @@
 #include <linux/nodemask.h>
 #include <linux/cpu.h>
 #include <linux/device.h>
+#include <linux/pm_runtime.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
 
@@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev,
 static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
+/**
+ * struct node_access_nodes - Access class device to hold user visible
+ * 			      relationships to other nodes.
+ * @dev:	Device for this memory access class
+ * @list_node:	List element in the node's access list
+ * @access:	The access class rank
+ */
+struct node_access_nodes {
+	struct device		dev;
+	struct list_head	list_node;
+	unsigned		access;
+};
+#define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
+
+static struct attribute *node_init_access_node_attrs[] = {
+	NULL,
+};
+
+static struct attribute *node_targ_access_node_attrs[] = {
+	NULL,
+};
+
+static const struct attribute_group initiators = {
+	.name	= "initiators",
+	.attrs	= node_init_access_node_attrs,
+};
+
+static const struct attribute_group targets = {
+	.name	= "targets",
+	.attrs	= node_targ_access_node_attrs,
+};
+
+static const struct attribute_group *node_access_node_groups[] = {
+	&initiators,
+	&targets,
+	NULL,
+};
+
+static void node_remove_accesses(struct node *node)
+{
+	struct node_access_nodes *c, *cnext;
+
+	list_for_each_entry_safe(c, cnext, &node->access_list, list_node) {
+		list_del(&c->list_node);
+		device_unregister(&c->dev);
+	}
+}
+
+static void node_access_release(struct device *dev)
+{
+	kfree(to_access_nodes(dev));
+}
+
+static struct node_access_nodes *node_init_node_access(struct node *node,
+						       unsigned access)
+{
+	struct node_access_nodes *access_node;
+	struct device *dev;
+
+	list_for_each_entry(access_node, &node->access_list, list_node)
+		if (access_node->access == access)
+			return access_node;
+
+	access_node = kzalloc(sizeof(*access_node), GFP_KERNEL);
+	if (!access_node)
+		return NULL;
+
+	access_node->access = access;
+	dev = &access_node->dev;
+	dev->parent = &node->dev;
+	dev->release = node_access_release;
+	dev->groups = node_access_node_groups;
+	if (dev_set_name(dev, "access%u", access))
+		goto free;
+
+	if (device_register(dev))
+		goto free_name;
+
+	pm_runtime_no_callbacks(dev);
+	list_add_tail(&access_node->list_node, &node->access_list);
+	return access_node;
+free_name:
+	kfree_const(dev->kobj.name);
+free:
+	kfree(access_node);
+	return NULL;
+}
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
@@ -340,7 +429,7 @@ static int register_node(struct node *node, int num)
 void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
-
+	node_remove_accesses(node);
 	device_unregister(&node->dev);
 }
 
@@ -372,6 +461,56 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
 				 kobject_name(&node_devices[nid]->dev.kobj));
 }
 
+/**
+ * register_memory_node_under_compute_node - link memory node to its compute
+ *					     node for a given access class.
+ * @mem_node:	Memory node number
+ * @cpu_node:	Cpu  node number
+ * @access:	Access class to register
+ *
+ * Description:
+ * 	For use with platforms that may have separate memory and compute nodes.
+ * 	This function will export node relationships linking which memory
+ * 	initiator nodes can access memory targets at a given ranked access
+ * 	class.
+ */
+int register_memory_node_under_compute_node(unsigned int mem_nid,
+					    unsigned int cpu_nid,
+					    unsigned access)
+{
+	struct node *init_node, *targ_node;
+	struct node_access_nodes *initiator, *target;
+	int ret;
+
+	if (!node_online(cpu_nid) || !node_online(mem_nid))
+		return -ENODEV;
+
+	init_node = node_devices[cpu_nid];
+	targ_node = node_devices[mem_nid];
+	initiator = node_init_node_access(init_node, access);
+	target = node_init_node_access(targ_node, access);
+	if (!initiator || !target)
+		return -ENOMEM;
+
+	ret = sysfs_add_link_to_group(&initiator->dev.kobj, "targets",
+				      &targ_node->dev.kobj,
+				      dev_name(&targ_node->dev));
+	if (ret)
+		return ret;
+
+	ret = sysfs_add_link_to_group(&target->dev.kobj, "initiators",
+				      &init_node->dev.kobj,
+				      dev_name(&init_node->dev));
+	if (ret)
+		goto err;
+
+	return 0;
+ err:
+	sysfs_remove_link_from_group(&initiator->dev.kobj, "targets",
+				     dev_name(&targ_node->dev));
+	return ret;
+}
+
 int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 {
 	struct device *obj;
@@ -580,6 +719,7 @@ int __register_one_node(int nid)
 			register_cpu_under_node(cpu, nid);
 	}
 
+	INIT_LIST_HEAD(&node_devices[nid]->access_list);
 	/* initialize work queue for memory hot plug */
 	init_node_hugetlb_work(nid);
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..f34688a203c1 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -17,11 +17,12 @@
 
 #include <linux/device.h>
 #include <linux/cpumask.h>
+#include <linux/list.h>
 #include <linux/workqueue.h>
 
 struct node {
 	struct device	dev;
-
+	struct list_head access_list;
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
@@ -75,6 +76,10 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
+extern int register_memory_node_under_compute_node(unsigned int mem_nid,
+						   unsigned int cpu_nid,
+						   unsigned access);
+
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
 					 node_registration_func_t unregister);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 05/10] acpi/hmat: Register processor domain to its memory
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
                   ` (3 preceding siblings ...)
  2019-01-24 23:07 ` [PATCHv5 04/10] node: Link memory nodes to their compute nodes Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-02-06 12:26     ` Jonathan Cameron
  2019-01-24 23:07 ` [PATCHv5 06/10] node: Add heterogenous memory access attributes Keith Busch
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

If the HMAT Subsystem Address Range provides a valid processor proximity
domain for a memory domain, or a processor domain matches the performance
access of the valid processor proximity domain, register the memory
target with that initiator so this relationship will be visible under
the node's sysfs directory.

Since HMAT requires valid address ranges have an equivalent SRAT entry,
verify each memory target satisfies this requirement.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/hmat/hmat.c | 310 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 310 insertions(+)

diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index 1741bf30d87f..85fd835c2e23 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -16,6 +16,91 @@
 #include <linux/node.h>
 #include <linux/sysfs.h>
 
+static __initdata LIST_HEAD(targets);
+static __initdata LIST_HEAD(initiators);
+static __initdata LIST_HEAD(localities);
+
+struct memory_target {
+	struct list_head node;
+	unsigned int memory_pxm;
+	unsigned int processor_pxm;
+	unsigned int read_bandwidth;
+	unsigned int write_bandwidth;
+	unsigned int read_latency;
+	unsigned int write_latency;
+};
+
+struct memory_initiator {
+	struct list_head node;
+	unsigned int processor_pxm;
+};
+
+struct memory_locality {
+	struct list_head node;
+	struct acpi_hmat_locality *hmat_loc;
+};
+
+static __init struct memory_initiator *find_mem_initiator(unsigned int cpu_pxm)
+{
+	struct memory_initiator *intitator;
+
+	list_for_each_entry(intitator, &initiators, node)
+		if (intitator->processor_pxm == cpu_pxm)
+			return intitator;
+	return NULL;
+}
+
+static __init struct memory_target *find_mem_target(unsigned int mem_pxm)
+{
+	struct memory_target *target;
+
+	list_for_each_entry(target, &targets, node)
+		if (target->memory_pxm == mem_pxm)
+			return target;
+	return NULL;
+}
+
+static __init struct memory_initiator *alloc_memory_initiator(
+							unsigned int cpu_pxm)
+{
+	struct memory_initiator *intitator;
+
+	if (pxm_to_node(cpu_pxm) == NUMA_NO_NODE)
+		return NULL;
+
+	intitator = find_mem_initiator(cpu_pxm);
+	if (intitator)
+		return intitator;
+
+	intitator = kzalloc(sizeof(*intitator), GFP_KERNEL);
+	if (!intitator)
+		return NULL;
+
+	intitator->processor_pxm = cpu_pxm;
+	list_add_tail(&intitator->node, &initiators);
+	return intitator;
+}
+
+static __init void alloc_memory_target(unsigned int mem_pxm)
+{
+	struct memory_target *target;
+
+	if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
+		return;
+
+	target = find_mem_target(mem_pxm);
+	if (target)
+		return;
+
+	target = kzalloc(sizeof(*target), GFP_KERNEL);
+	if (!target)
+		return;
+
+	target->memory_pxm = mem_pxm;
+	target->processor_pxm = PXM_INVAL;
+	list_add_tail(&target->node, &targets);
+}
+
 static __init const char *hmat_data_type(u8 type)
 {
 	switch (type) {
@@ -52,13 +137,45 @@ static __init const char *hmat_data_type_suffix(u8 type)
 	};
 }
 
+static __init void hmat_update_target_access(struct memory_target *target,
+                                             u8 type, u32 value)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+		target->read_latency = value;
+		target->write_latency = value;
+		break;
+	case ACPI_HMAT_READ_LATENCY:
+		target->read_latency = value;
+		break;
+	case ACPI_HMAT_WRITE_LATENCY:
+		target->write_latency = value;
+		break;
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+		target->read_bandwidth = value;
+		target->write_bandwidth = value;
+		break;
+	case ACPI_HMAT_READ_BANDWIDTH:
+		target->read_bandwidth = value;
+		break;
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		target->write_bandwidth = value;
+		break;
+	default:
+		break;
+	};
+}
+
 static __init int hmat_parse_locality(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
 	struct acpi_hmat_locality *hmat_loc = (void *)header;
+	struct memory_target *target;
+	struct memory_initiator *initiator;
 	unsigned int init, targ, total_size, ipds, tpds;
 	u32 *inits, *targs, value;
 	u16 *entries;
+	bool report = false;
 	u8 type;
 
 	if (hmat_loc->header.length < sizeof(*hmat_loc)) {
@@ -82,16 +199,42 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
 		hmat_loc->flags, hmat_data_type(type), ipds, tpds,
 		hmat_loc->entry_base_unit);
 
+	/* Don't report performance of memory side caches */
+	switch (hmat_loc->flags & ACPI_HMAT_MEMORY_HIERARCHY) {
+	case ACPI_HMAT_MEMORY:
+	case ACPI_HMAT_LAST_LEVEL_CACHE:
+		report = true;
+		break;
+	default:
+		break;
+	}
+
 	inits = (u32 *)(hmat_loc + 1);
 	targs = &inits[ipds];
 	entries = (u16 *)(&targs[tpds]);
 	for (init = 0; init < ipds; init++) {
+		initiator = alloc_memory_initiator(inits[init]);
 		for (targ = 0; targ < tpds; targ++) {
 			value = entries[init * tpds + targ];
 			value = (value * hmat_loc->entry_base_unit) / 10;
 			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
 				inits[init], targs[targ], value,
 				hmat_data_type_suffix(type));
+
+			target = find_mem_target(targs[targ]);
+			if (target && report &&
+			    target->processor_pxm == initiator->processor_pxm)
+				hmat_update_target_access(target, type, value);
+		}
+	}
+
+	if (report) {
+		struct memory_locality *loc;
+
+		loc = kzalloc(sizeof(*loc), GFP_KERNEL);
+		if (loc) {
+			loc->hmat_loc = hmat_loc;
+			list_add_tail(&loc->node, &localities);
 		}
 	}
 
@@ -122,16 +265,35 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
 					   const unsigned long end)
 {
 	struct acpi_hmat_address_range *spa = (void *)header;
+	struct memory_target *target = NULL;
 
 	if (spa->header.length != sizeof(*spa)) {
 		pr_debug("HMAT: Unexpected address range header length: %d\n",
 			 spa->header.length);
 		return -EINVAL;
 	}
+
 	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
 		spa->physical_address_base, spa->physical_address_length,
 		spa->flags, spa->processor_PD, spa->memory_PD);
 
+	if (spa->flags & ACPI_HMAT_MEMORY_PD_VALID) {
+		target = find_mem_target(spa->memory_PD);
+		if (!target) {
+			pr_debug("HMAT: Memory Domain missing from SRAT\n");
+			return -EINVAL;
+		}
+	}
+	if (target && spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
+		int p_node = pxm_to_node(spa->processor_PD);
+
+		if (p_node == NUMA_NO_NODE) {
+			pr_debug("HMAT: Invalid Processor Domain\n");
+			return -EINVAL;
+		}
+		target->processor_pxm = p_node;
+	}
+
 	return 0;
 }
 
@@ -155,6 +317,142 @@ static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
 	}
 }
 
+static __init int srat_parse_mem_affinity(union acpi_subtable_headers *header,
+					  const unsigned long end)
+{
+	struct acpi_srat_mem_affinity *ma = (void *)header;
+
+	if (!ma)
+		return -EINVAL;
+	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
+		return 0;
+	alloc_memory_target(ma->proximity_domain);
+	return 0;
+}
+
+static __init bool hmat_is_local(struct memory_target *target,
+                                 u8 type, u32 value)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+		return value == target->read_latency &&
+		       value == target->write_latency;
+	case ACPI_HMAT_READ_LATENCY:
+		return value == target->read_latency;
+	case ACPI_HMAT_WRITE_LATENCY:
+		return value == target->write_latency;
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+		return value == target->read_bandwidth &&
+		       value == target->write_bandwidth;
+	case ACPI_HMAT_READ_BANDWIDTH:
+		return value == target->read_bandwidth;
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		return value == target->write_bandwidth;
+	default:
+		return true;
+	};
+}
+
+static bool hmat_is_local_initiator(struct memory_target *target,
+				    struct memory_initiator *initiator,
+				    struct acpi_hmat_locality *hmat_loc)
+{
+	unsigned int ipds, tpds, i, idx = 0, tdx = 0;
+	u32 *inits, *targs, value;
+	u16 *entries;
+
+	ipds = hmat_loc->number_of_initiator_Pds;
+	tpds = hmat_loc->number_of_target_Pds;
+	inits = (u32 *)(hmat_loc + 1);
+	targs = &inits[ipds];
+	entries = (u16 *)(&targs[tpds]);
+
+	for (i = 0; i < ipds; i++) {
+		if (inits[i] == initiator->processor_pxm) {
+			idx = i;
+			break;
+		}
+	}
+
+	if (i == ipds)
+		return false;
+
+	for (i = 0; i < tpds; i++) {
+		if (targs[i] == target->memory_pxm) {
+			tdx = i;
+			break;
+		}
+	}
+	if (i == tpds)
+		return false;
+
+	value = entries[idx * tpds + tdx];
+	value = (value * hmat_loc->entry_base_unit) / 10;
+
+	return hmat_is_local(target, hmat_loc->data_type, value);
+}
+
+static __init void hmat_register_if_local(struct memory_target *target,
+					  struct memory_initiator *initiator)
+{
+	unsigned int mem_nid, cpu_nid;
+	struct memory_locality *loc;
+
+	if (initiator->processor_pxm == target->processor_pxm)
+		return;
+
+	list_for_each_entry(loc, &localities, node)
+		if (!hmat_is_local_initiator(target, initiator, loc->hmat_loc))
+			return;
+
+	mem_nid = pxm_to_node(target->memory_pxm);
+	cpu_nid = pxm_to_node(initiator->processor_pxm);
+	register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
+}
+
+static __init void hmat_register_target_initiators(struct memory_target *target)
+{
+	struct memory_initiator *initiator;
+	unsigned int mem_nid, cpu_nid;
+
+	if (target->processor_pxm == PXM_INVAL)
+		return;
+
+	mem_nid = pxm_to_node(target->memory_pxm);
+	cpu_nid = pxm_to_node(target->processor_pxm);
+	if (register_memory_node_under_compute_node(mem_nid, cpu_nid, 0))
+		return;
+
+	if (list_empty(&localities))
+		return;
+
+	list_for_each_entry(initiator, &initiators, node)
+		hmat_register_if_local(target, initiator);
+}
+
+static __init void hmat_register_targets(void)
+{
+	struct memory_target *target, *tnext;
+	struct memory_locality *loc, *lnext;
+	struct memory_initiator *intitator, *inext;
+
+	list_for_each_entry_safe(target, tnext, &targets, node) {
+		list_del(&target->node);
+		hmat_register_target_initiators(target);
+		kfree(target);
+	}
+
+	list_for_each_entry_safe(intitator, inext, &initiators, node) {
+		list_del(&intitator->node);
+		kfree(intitator);
+	}
+
+	list_for_each_entry_safe(loc, lnext, &localities, node) {
+		list_del(&loc->node);
+		kfree(loc);
+	}
+}
+
 static __init int hmat_init(void)
 {
 	struct acpi_table_header *tbl;
@@ -164,6 +462,17 @@ static __init int hmat_init(void)
 	if (srat_disabled())
 		return 0;
 
+	status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
+	if (ACPI_FAILURE(status))
+		return 0;
+
+	if (acpi_table_parse_entries(ACPI_SIG_SRAT,
+				sizeof(struct acpi_table_srat),
+				ACPI_SRAT_TYPE_MEMORY_AFFINITY,
+				srat_parse_mem_affinity, 0) < 0)
+		goto out_put;
+	acpi_put_table(tbl);
+
 	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
 	if (ACPI_FAILURE(status))
 		return 0;
@@ -174,6 +483,7 @@ static __init int hmat_init(void)
 					     hmat_parse_subtable, 0) < 0)
 			goto out_put;
 	}
+	hmat_register_targets();
 out_put:
 	acpi_put_table(tbl);
 	return 0;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 06/10] node: Add heterogenous memory access attributes
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
                   ` (4 preceding siblings ...)
  2019-01-24 23:07 ` [PATCHv5 05/10] acpi/hmat: Register processor domain to its memory Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-01-24 23:07 ` [PATCHv5 07/10] acpi/hmat: Register performance attributes Keith Busch
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Provide a new kernel interface
for subsystems to register the attributes under the memory target
node's initiator access class. If the system provides this information,
applications may query these attributes when deciding which node to
request memory.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/
  /sys/devices/system/node/nodeY/accessZ/initiators/
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in
nanoseconds. The values are taken from the platform as reported by the
manufacturer.

Memory accesses from an initiator node that is not one of the memory's
access "Z" initiator nodes linked in the same directory may observe
different performance than reported here. When a subsystem makes use
of this interface, initiators of a different access number may not have
the same performance relative to initiators in other access numbers, or
omitted from the any access class' initiators.

Descriptions for memory access initiator performance access attributes
are added to sysfs stable documentation.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/ABI/stable/sysfs-devices-node | 28 ++++++++++++++
 drivers/base/Kconfig                        |  8 ++++
 drivers/base/node.c                         | 59 +++++++++++++++++++++++++++++
 include/linux/node.h                        | 19 ++++++++++
 4 files changed, 114 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index fb843222a281..41cb9345e1e0 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -114,3 +114,31 @@ Contact:	Keith Busch <keith.busch@intel.com>
 Description:
 		The directory containing symlinks to memory targets that
 		this initiator node has class "Y" access.
+
+What:		/sys/devices/system/node/nodeX/accessY/initiators/read_bandwidth
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This node's read bandwidth in MB/s when accessed from
+		nodes found in this access class's linked initiators.
+
+What:		/sys/devices/system/node/nodeX/accessY/initiators/read_latency
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This node's read latency in nanoseconds when accessed
+		from nodes found in this access class's linked initiators.
+
+What:		/sys/devices/system/node/nodeX/accessY/initiators/write_bandwidth
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This node's write bandwidth in MB/s when accessed from
+		found in this access class's linked initiators.
+
+What:		/sys/devices/system/node/nodeX/accessY/initiators/write_latency
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This node's write latency in nanoseconds when access
+		from nodes found in this class's linked initiators.
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a900b330..32dc81bd7056 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE
 	  unusable. You should say N here unless you are explicitly looking to
 	  test this functionality.
 
+config HMEM_REPORTING
+	bool
+	default n
+	depends on NUMA
+	help
+	  Enable reporting for heterogenous memory access attributes under
+	  their non-uniform memory nodes.
+
 source "drivers/base/test/Kconfig"
 
 config SYS_HYPERVISOR
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 6f4097680580..2de546a040a5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -71,6 +71,9 @@ struct node_access_nodes {
 	struct device		dev;
 	struct list_head	list_node;
 	unsigned		access;
+#ifdef CONFIG_HMEM_REPORTING
+	struct node_hmem_attrs	hmem_attrs;
+#endif
 };
 #define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
 
@@ -148,6 +151,62 @@ static struct node_access_nodes *node_init_node_access(struct node *node,
 	return NULL;
 }
 
+#ifdef CONFIG_HMEM_REPORTING
+#define ACCESS_ATTR(name) 						   \
+static ssize_t name##_show(struct device *dev,				   \
+			   struct device_attribute *attr,		   \
+			   char *buf)					   \
+{									   \
+	return sprintf(buf, "%u\n", to_access_nodes(dev)->hmem_attrs.name); \
+}									   \
+static DEVICE_ATTR_RO(name);
+
+ACCESS_ATTR(read_bandwidth)
+ACCESS_ATTR(read_latency)
+ACCESS_ATTR(write_bandwidth)
+ACCESS_ATTR(write_latency)
+
+static struct attribute *access_attrs[] = {
+	&dev_attr_read_bandwidth.attr,
+	&dev_attr_read_latency.attr,
+	&dev_attr_write_bandwidth.attr,
+	&dev_attr_write_latency.attr,
+	NULL,
+};
+
+/**
+ * node_set_perf_attrs - Set the performance values for given access class
+ * @nid: Node identifier to be set
+ * @hmem_attrs: Heterogeneous memory performance attributes
+ * @access: The access class the for the given attributes
+ */
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
+			 unsigned access)
+{
+	struct node_access_nodes *c;
+	struct node *node;
+	int i;
+
+	if (WARN_ON_ONCE(!node_online(nid)))
+		return;
+
+	node = node_devices[nid];
+	c = node_init_node_access(node, access);
+	if (!c)
+		return;
+
+	c->hmem_attrs = *hmem_attrs;
+	for (i = 0; access_attrs[i] != NULL; i++) {
+		if (sysfs_add_file_to_group(&c->dev.kobj, access_attrs[i],
+					    "initiators")) {
+			pr_info("failed to add performance attribute to node %d\n",
+				nid);
+			break;
+		}
+	}
+}
+#endif
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
diff --git a/include/linux/node.h b/include/linux/node.h
index f34688a203c1..2db077363d9c 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -20,6 +20,25 @@
 #include <linux/list.h>
 #include <linux/workqueue.h>
 
+#ifdef CONFIG_HMEM_REPORTING
+/**
+ * struct node_hmem_attrs - heterogeneous memory performance attributes
+ *
+ * @read_bandwidth:	Read bandwidth in MB/s
+ * @write_bandwidth:	Write bandwidth in MB/s
+ * @read_latency:	Read latency in nanoseconds
+ * @write_latency:	Write latency in nanoseconds
+ */
+struct node_hmem_attrs {
+	unsigned int read_bandwidth;
+	unsigned int write_bandwidth;
+	unsigned int read_latency;
+	unsigned int write_latency;
+};
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
+			 unsigned access);
+#endif
+
 struct node {
 	struct device	dev;
 	struct list_head access_list;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 07/10] acpi/hmat: Register performance attributes
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
                   ` (5 preceding siblings ...)
  2019-01-24 23:07 ` [PATCHv5 06/10] node: Add heterogenous memory access attributes Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-02-06 12:24     ` Jonathan Cameron
  2019-01-24 23:07 ` [PATCHv5 08/10] node: Add memory caching attributes Keith Busch
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Register the local attached performace access attributes with the memory's
node if HMAT provides the locality table. While HMAT does make it possible
to know performance for all possible initiator-target pairings, we export
only the local and matching pairings at this time.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/hmat/Kconfig |  1 +
 drivers/acpi/hmat/hmat.c  | 14 ++++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
index c9637e2e7514..08e972ead159 100644
--- a/drivers/acpi/hmat/Kconfig
+++ b/drivers/acpi/hmat/Kconfig
@@ -2,6 +2,7 @@
 config ACPI_HMAT
 	bool "ACPI Heterogeneous Memory Attribute Table Support"
 	depends on ACPI_NUMA
+	select HMEM_REPORTING
 	help
 	 If set, this option causes the kernel to set the memory NUMA node
 	 relationships and access attributes in accordance with ACPI HMAT
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index 85fd835c2e23..917e6122b3f0 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -430,6 +430,19 @@ static __init void hmat_register_target_initiators(struct memory_target *target)
 		hmat_register_if_local(target, initiator);
 }
 
+static __init void hmat_register_target_perf(struct memory_target *target)
+{
+	unsigned mem_nid = pxm_to_node(target->memory_pxm);
+	struct node_hmem_attrs hmem_attrs = {
+		.read_bandwidth	= target->read_bandwidth,
+		.write_bandwidth= target->write_bandwidth,
+		.read_latency	= target->read_latency,
+		.write_latency	= target->write_latency,
+	};
+
+	node_set_perf_attrs(mem_nid, &hmem_attrs, 0);
+}
+
 static __init void hmat_register_targets(void)
 {
 	struct memory_target *target, *tnext;
@@ -439,6 +452,7 @@ static __init void hmat_register_targets(void)
 	list_for_each_entry_safe(target, tnext, &targets, node) {
 		list_del(&target->node);
 		hmat_register_target_initiators(target);
+		hmat_register_target_perf(target);
 		kfree(target);
 	}
 
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 08/10] node: Add memory caching attributes
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
                   ` (6 preceding siblings ...)
  2019-01-24 23:07 ` [PATCHv5 07/10] acpi/hmat: Register performance attributes Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-02-06 12:24     ` Jonathan Cameron
  2019-01-24 23:07 ` [PATCHv5 09/10] acpi/hmat: Register memory side cache attributes Keith Busch
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

System memory may have side caches to help improve access speed to
frequently requested address ranges. While the system provided cache is
transparent to the software accessing these memory ranges, applications
can optimize their own access based on cache attributes.

Provide a new API for the kernel to register these memory side caches
under the memory node that provides it.

The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/<cpu>/side_cache/.
Unlike CPU cacheinfo though, the node cache level is reported from
the view of the memory. A higher number is nearer to the CPU, while
lower levels are closer to the backing memory. Also unlike CPU cache,
it is assumed the system will handle flushing any dirty cached memory
to the last level on a power failure if the range is persistent memory.

The attributes we export are the cache size, the line size, associativity,
and write back policy.

Add the attributes for the system memory side caches to sysfs stable
documentation.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/ABI/stable/sysfs-devices-node |  34 +++++++
 drivers/base/node.c                         | 153 ++++++++++++++++++++++++++++
 include/linux/node.h                        |  34 +++++++
 3 files changed, 221 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 41cb9345e1e0..26327279b6b6 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -142,3 +142,37 @@ Contact:	Keith Busch <keith.busch@intel.com>
 Description:
 		This node's write latency in nanoseconds when access
 		from nodes found in this class's linked initiators.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/associativity
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The caches associativity: 0 for direct mapped, non-zero if
+		indexed.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/level
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This cache's level in the memory hierarchy. Matches 'Y' in the
+		directory name.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/line_size
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The number of bytes accessed from the next cache level on a
+		cache miss.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/size
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The size of this memory side cache in bytes.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/write_policy
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The cache write policy: 0 for write-back, 1 for write-through,
+		other or unknown.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2de546a040a5..9b4cb29863ff 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -205,6 +205,157 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
 		}
 	}
 }
+
+/**
+ * struct node_cache_info - Internal tracking for memory node caches
+ * @dev:	Device represeting the cache level
+ * @node:	List element for tracking in the node
+ * @cache_attrs:Attributes for this cache level
+ */
+struct node_cache_info {
+	struct device dev;
+	struct list_head node;
+	struct node_cache_attrs cache_attrs;
+};
+#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
+
+#define CACHE_ATTR(name, fmt) 						\
+static ssize_t name##_show(struct device *dev,				\
+			   struct device_attribute *attr,		\
+			   char *buf)					\
+{									\
+	return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
+}									\
+DEVICE_ATTR_RO(name);
+
+CACHE_ATTR(size, "%llu")
+CACHE_ATTR(level, "%u")
+CACHE_ATTR(line_size, "%u")
+CACHE_ATTR(associativity, "%u")
+CACHE_ATTR(write_policy, "%u")
+
+static struct attribute *cache_attrs[] = {
+	&dev_attr_level.attr,
+	&dev_attr_associativity.attr,
+	&dev_attr_size.attr,
+	&dev_attr_line_size.attr,
+	&dev_attr_write_policy.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(cache);
+
+static void node_cache_release(struct device *dev)
+{
+	kfree(dev);
+}
+
+static void node_cacheinfo_release(struct device *dev)
+{
+	struct node_cache_info *info = to_cache_info(dev);
+	kfree(info);
+}
+
+static void node_init_cache_dev(struct node *node)
+{
+	struct device *dev;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return;
+
+	dev->parent = &node->dev;
+	dev->release = node_cache_release;
+	if (dev_set_name(dev, "side_cache"))
+		goto free_dev;
+
+	if (device_register(dev))
+		goto free_name;
+
+	pm_runtime_no_callbacks(dev);
+	node->cache_dev = dev;
+	return;
+free_name:
+	kfree_const(dev->kobj.name);
+free_dev:
+	kfree(dev);
+}
+
+/**
+ * node_add_cache - add cache attribute to a memory node
+ * @nid: Node identifier that has new cache attributes
+ * @cache_attrs: Attributes for the cache being added
+ */
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
+{
+	struct node_cache_info *info;
+	struct device *dev;
+	struct node *node;
+
+	if (!node_online(nid) || !node_devices[nid])
+		return;
+
+	node = node_devices[nid];
+	list_for_each_entry(info, &node->cache_attrs, node) {
+		if (info->cache_attrs.level == cache_attrs->level) {
+			dev_warn(&node->dev,
+				"attempt to add duplicate cache level:%d\n",
+				cache_attrs->level);
+			return;
+		}
+	}
+
+	if (!node->cache_dev)
+		node_init_cache_dev(node);
+	if (!node->cache_dev)
+		return;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return;
+
+	dev = &info->dev;
+	dev->parent = node->cache_dev;
+	dev->release = node_cacheinfo_release;
+	dev->groups = cache_groups;
+	if (dev_set_name(dev, "index%d", cache_attrs->level))
+		goto free_cache;
+
+	info->cache_attrs = *cache_attrs;
+	if (device_register(dev)) {
+		dev_warn(&node->dev, "failed to add cache level:%d\n",
+			 cache_attrs->level);
+		goto free_name;
+	}
+	pm_runtime_no_callbacks(dev);
+	list_add_tail(&info->node, &node->cache_attrs);
+	return;
+free_name:
+	kfree_const(dev->kobj.name);
+free_cache:
+	kfree(info);
+}
+
+static void node_remove_caches(struct node *node)
+{
+	struct node_cache_info *info, *next;
+
+	if (!node->cache_dev)
+		return;
+
+	list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
+		list_del(&info->node);
+		device_unregister(&info->dev);
+	}
+	device_unregister(node->cache_dev);
+}
+
+static void node_init_caches(unsigned int nid)
+{
+	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
+}
+#else
+static void node_init_caches(unsigned int nid) { }
+static void node_remove_caches(struct node *node) { }
 #endif
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
@@ -489,6 +640,7 @@ void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
 	node_remove_accesses(node);
+	node_remove_caches(node);
 	device_unregister(&node->dev);
 }
 
@@ -781,6 +933,7 @@ int __register_one_node(int nid)
 	INIT_LIST_HEAD(&node_devices[nid]->access_list);
 	/* initialize work queue for memory hot plug */
 	init_node_hugetlb_work(nid);
+	node_init_caches(nid);
 
 	return error;
 }
diff --git a/include/linux/node.h b/include/linux/node.h
index 2db077363d9c..842e4ab2ae6d 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -37,6 +37,36 @@ struct node_hmem_attrs {
 };
 void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
 			 unsigned access);
+
+enum cache_associativity {
+	NODE_CACHE_DIRECT_MAP,
+	NODE_CACHE_INDEXED,
+	NODE_CACHE_OTHER,
+};
+
+enum cache_write_policy {
+	NODE_CACHE_WRITE_BACK,
+	NODE_CACHE_WRITE_THROUGH,
+	NODE_CACHE_WRITE_OTHER,
+};
+
+/**
+ * struct node_cache_attrs - system memory caching attributes
+ *
+ * @associativity:	The ways memory blocks may be placed in cache
+ * @write_policy:	Write back or write through policy
+ * @size:		Total size of cache in bytes
+ * @line_size:		Number of bytes fetched on a cache miss
+ * @level:		Represents the cache hierarchy level
+ */
+struct node_cache_attrs {
+	enum cache_associativity associativity;
+	enum cache_write_policy write_policy;
+	u64 size;
+	u16 line_size;
+	u8  level;
+};
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
 #endif
 
 struct node {
@@ -45,6 +75,10 @@ struct node {
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
+#ifdef CONFIG_HMEM_REPORTING
+	struct list_head cache_attrs;
+	struct device *cache_dev;
+#endif
 };
 
 struct memory_block;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 09/10] acpi/hmat: Register memory side cache attributes
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
                   ` (7 preceding siblings ...)
  2019-01-24 23:07 ` [PATCHv5 08/10] node: Add memory caching attributes Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-02-06 12:17     ` Jonathan Cameron
  2019-01-24 23:07 ` [PATCHv5 10/10] doc/mm: New documentation for memory performance Keith Busch
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Register memory side cache attributes with the memory's node if HMAT
provides the side cache iniformation table.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/hmat/hmat.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index 917e6122b3f0..11f65b38e9f9 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -245,6 +245,7 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
 				   const unsigned long end)
 {
 	struct acpi_hmat_cache *cache = (void *)header;
+	struct node_cache_attrs cache_attrs;
 	u32 attrs;
 
 	if (cache->header.length < sizeof(*cache)) {
@@ -258,6 +259,37 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
 		cache->memory_PD, cache->cache_size, attrs,
 		cache->number_of_SMBIOShandles);
 
+	cache_attrs.size = cache->cache_size;
+	cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4;
+	cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16;
+
+	switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) {
+	case ACPI_HMAT_CA_DIRECT_MAPPED:
+		cache_attrs.associativity = NODE_CACHE_DIRECT_MAP;
+		break;
+	case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING:
+		cache_attrs.associativity = NODE_CACHE_INDEXED;
+		break;
+	case ACPI_HMAT_CA_NONE:
+	default:
+		cache_attrs.associativity = NODE_CACHE_OTHER;
+		break;
+	}
+
+	switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) {
+	case ACPI_HMAT_CP_WB:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_BACK;
+		break;
+	case ACPI_HMAT_CP_WT:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH;
+		break;
+	case ACPI_HMAT_CP_NONE:
+	default:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER;
+		break;
+	}
+
+	node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs);
 	return 0;
 }
 
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCHv5 10/10] doc/mm: New documentation for memory performance
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
                   ` (8 preceding siblings ...)
  2019-01-24 23:07 ` [PATCHv5 09/10] acpi/hmat: Register memory side cache attributes Keith Busch
@ 2019-01-24 23:07 ` Keith Busch
  2019-02-06 10:45     ` Jonathan Cameron
  2019-01-28 14:00 ` [PATCHv5 00/10] Heterogeneuos memory node attributes Michal Hocko
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-01-24 23:07 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Platforms may provide system memory where some physical address ranges
perform differently than others, or is side cached by the system.

Add documentation describing a high level overview of such systems and the
perforamnce and caching attributes the kernel provides for applications
wishing to query this information.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/admin-guide/mm/numaperf.rst | 167 ++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/numaperf.rst

diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
new file mode 100644
index 000000000000..52999336a8ed
--- /dev/null
+++ b/Documentation/admin-guide/mm/numaperf.rst
@@ -0,0 +1,167 @@
+.. _numaperf:
+
+=============
+NUMA Locality
+=============
+
+Some platforms may have multiple types of memory attached to a single
+CPU. These disparate memory ranges share some characteristics, such as
+CPU cache coherence, but may have different performance. For example,
+different media types and buses affect bandwidth and latency.
+
+A system supporting such heterogeneous memory by grouping each memory
+type under different "nodes" based on similar CPU locality and performance
+characteristics.  Some memory may share the same node as a CPU, and others
+are provided as memory only nodes. While memory only nodes do not provide
+CPUs, they may still be directly accessible, or local, to one or more
+compute nodes. The following diagram shows one such example of two compute
+nodes with local memory and a memory only node for each of compute node:
+
+ +------------------+     +------------------+
+ | Compute Node 0   +-----+ Compute Node 1   |
+ | Local Node0 Mem  |     | Local Node1 Mem  |
+ +--------+---------+     +--------+---------+
+          |                        |
+ +--------+---------+     +--------+---------+
+ | Slower Node2 Mem |     | Slower Node3 Mem |
+ +------------------+     +--------+---------+
+
+A "memory initiator" is a node containing one or more devices such as
+CPUs or separate memory I/O devices that can initiate memory requests.
+A "memory target" is a node containing one or more physical address
+ranges accessible from one or more memory initiators.
+
+When multiple memory initiators exist, they may not all have the same
+performance when accessing a given memory target. Each initiator-target
+pair may be organized into different ranked access classes to represent
+this relationship. The highest performing initiator to a given target
+is considered to be one of that target's local initiators, and given
+the highest access class, 0. Any given target may have one or more
+local initiators, and any given initiator may have multiple local
+memory targets.
+
+To aid applications matching memory targets with their initiators, the
+kernel provides symlinks to each other. The following example lists the
+relationship for the access class "0" memory initiators and targets, which is
+the of nodes with the highest performing access relationship::
+
+	# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
+	relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
+
+	# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
+	relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
+
+================
+NUMA Performance
+================
+
+Applications may wish to consider which node they want their memory to
+be allocated from based on the node's performance characteristics. If
+the system provides these attributes, the kernel exports them under the
+node sysfs hierarchy by appending the attributes directory under the
+memory node's access class 0 initiators as follows::
+
+	/sys/devices/system/node/nodeY/access0/initiators/
+
+These attributes apply only when accessed from nodes that have the
+are linked under the this access's inititiators.
+
+The performance characteristics the kernel provides for the local initiators
+are exported are as follows::
+
+	# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/
+	/sys/devices/system/node/nodeY/access0/
+	|-- read_bandwidth
+	|-- read_latency
+	|-- write_bandwidth
+	`-- write_latency
+
+The bandwidth attributes are provided in MiB/second.
+
+The latency attributes are provided in nanoseconds.
+
+The values reported here correspond to the rated latency and bandwidth
+for the platform.
+
+==========
+NUMA Cache
+==========
+
+System memory may be constructed in a hierarchy of elements with various
+performance characteristics in order to provide large address space of
+slower performing memory side-cached by a smaller higher performing
+memory. The system physical addresses that initiators are aware of
+are provided by the last memory level in the hierarchy. The system
+meanwhile uses higher performing memory to transparently cache access
+to progressively slower levels.
+
+The term "far memory" is used to denote the last level memory in the
+hierarchy. Each increasing cache level provides higher performing
+initiator access, and the term "near memory" represents the fastest
+cache provided by the system.
+
+This numbering is different than CPU caches where the cache level (ex:
+L1, L2, L3) uses a CPU centric view with each increased level is lower
+performing. In contrast, the memory cache level is centric to the last
+level memory, so the higher numbered cache level denotes memory nearer
+to the CPU, and further from far memory.
+
+The memory side caches are not directly addressable by software. When
+software accesses a system address, the system will return it from the
+near memory cache if it is present. If it is not present, the system
+accesses the next level of memory until there is either a hit in that
+cache level, or it reaches far memory.
+
+An application does not need to know about caching attributes in order
+to use the system. Software may optionally query the memory cache
+attributes in order to maximize the performance out of such a setup.
+If the system provides a way for the kernel to discover this information,
+for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
+the kernel will append these attributes to the NUMA node memory target.
+
+When the kernel first registers a memory cache with a node, the kernel
+will create the following directory::
+
+	/sys/devices/system/node/nodeX/side_cache/
+
+If that directory is not present, the system either does not not provide
+a memory side cache, or that information is not accessible to the kernel.
+
+The attributes for each level of cache is provided under its cache
+level index::
+
+	/sys/devices/system/node/nodeX/side_cache/indexA/
+	/sys/devices/system/node/nodeX/side_cache/indexB/
+	/sys/devices/system/node/nodeX/side_cache/indexC/
+
+Each cache level's directory provides its attributes. For example, the
+following shows a single cache level and the attributes available for
+software to query::
+
+	# tree sys/devices/system/node/node0/side_cache/
+	/sys/devices/system/node/node0/side_cache/
+	|-- index1
+	|   |-- associativity
+	|   |-- level
+	|   |-- line_size
+	|   |-- size
+	|   `-- write_policy
+
+The "associativity" will be 0 if it is a direct-mapped cache, and non-zero
+for any other indexed based, multi-way associativity.
+
+The "level" is the distance from the far memory, and matches the number
+appended to its "index" directory.
+
+The "line_size" is the number of bytes accessed on a cache miss.
+
+The "size" is the number of bytes provided by this cache level.
+
+The "write_policy" will be 0 for write-back, and non-zero for
+write-through caching.
+
+========
+See Also
+========
+.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
+       Section 5.2.27
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
                   ` (9 preceding siblings ...)
  2019-01-24 23:07 ` [PATCHv5 10/10] doc/mm: New documentation for memory performance Keith Busch
@ 2019-01-28 14:00 ` Michal Hocko
  2019-02-06 12:31   ` Jonathan Cameron
  2019-02-07  9:53   ` Jonathan Cameron
  12 siblings, 0 replies; 53+ messages in thread
From: Michal Hocko @ 2019-01-28 14:00 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linux-api

Is there any reason why is this not CCing linux-api (Cced now).

On Thu 24-01-19 16:07:14, Keith Busch wrote:
> == Changes since v4 ==
> 
>   All public interfaces have kernel docs.
> 
>   Renamed "class" to "access", docs and changed logs updated
>   accordingly. (Rafael)
> 
>   The sysfs hierarchy is altered to put initiators and targets in their
>   own attribute group directories (Rafael).
> 
>   The node lists are removed. This feedback is in conflict with v1
>   feedback, but consensus wants to remove multi-value sysfs attributes,
>   which includes lists. We only have symlinks now, just like v1 provided.
> 
>   Documentation and code patches are combined such that the code
>   introducing new attributes and its documentation are in the same
>   patch. (Rafael and Dan).
> 
>   The performance attributes, bandwidth and latency, are moved into the
>   initiators directory. This should make it obvious for which node
>   access the attributes apply, which was previously ambiguous.
>   (Jonathan Cameron).
> 
>   The HMAT code selecting "local" initiators is substantially changed.
>   Only PXM's that have identical performance to the HMAT's processor PXM
>   in Address Range Structure are registered. This is to avoid considering
>   nodes identical when only one of several perf attributes are the same.
>   (Jonathan Cameron).
> 
>   Verbose variable naming. Examples include "initiator" and "target"
>   instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
>   "p". (Rafael)
> 
>   Compile fixes for when HMEM_REPORTING is not set. This is not a user
>   selectable config option, default 'n', and will have to be selected
>   by other config options that require it (Greg KH and Rafael).
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

Can you provide a highlevel description of these new attributes and how
they are supposed to be used.

Mentioning usecases is also due consideirng the amount of code this
adds. 

> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 354 ++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1344 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 
> -- 
> 2.14.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory
  2019-01-24 23:07 ` [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory Keith Busch
@ 2019-02-05 12:12     ` Rafael J. Wysocki
  2019-02-06 12:28     ` Jonathan Cameron
  1 sibling, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-05 12:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Fri, Jan 25, 2019 at 12:08 AM Keith Busch <keith.busch@intel.com> wrote:
>
> Systems may provide different memory types and export this information
> in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
> tables provided by the platform and report the memory access and caching
> attributes to the kernel messages.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/acpi/Kconfig       |   1 +
>  drivers/acpi/Makefile      |   1 +
>  drivers/acpi/hmat/Kconfig  |   8 ++
>  drivers/acpi/hmat/Makefile |   1 +
>  drivers/acpi/hmat/hmat.c   | 181 +++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 192 insertions(+)
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
>
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index 90ff0a47c12e..b377f970adfd 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -465,6 +465,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
>           If you are unsure what to do, do not enable this option.
>
>  source "drivers/acpi/nfit/Kconfig"
> +source "drivers/acpi/hmat/Kconfig"
>
>  source "drivers/acpi/apei/Kconfig"
>  source "drivers/acpi/dptf/Kconfig"
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index bb857421c2e8..5d361e4e3405 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -80,6 +80,7 @@ obj-$(CONFIG_ACPI_PROCESSOR)  += processor.o
>  obj-$(CONFIG_ACPI)             += container.o
>  obj-$(CONFIG_ACPI_THERMAL)     += thermal.o
>  obj-$(CONFIG_ACPI_NFIT)                += nfit/
> +obj-$(CONFIG_ACPI_HMAT)                += hmat/
>  obj-$(CONFIG_ACPI)             += acpi_memhotplug.o
>  obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>  obj-$(CONFIG_ACPI_BATTERY)     += battery.o
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> new file mode 100644
> index 000000000000..c9637e2e7514
> --- /dev/null
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config ACPI_HMAT
> +       bool "ACPI Heterogeneous Memory Attribute Table Support"
> +       depends on ACPI_NUMA
> +       help
> +        If set, this option causes the kernel to set the memory NUMA node
> +        relationships and access attributes in accordance with ACPI HMAT
> +        (Heterogeneous Memory Attributes Table).
> diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile
> new file mode 100644
> index 000000000000..e909051d3d00
> --- /dev/null
> +++ b/drivers/acpi/hmat/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_ACPI_HMAT) := hmat.o
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> new file mode 100644
> index 000000000000..1741bf30d87f
> --- /dev/null
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -0,0 +1,181 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, Intel Corporation.
> + *
> + * Heterogeneous Memory Attributes Table (HMAT) representation
> + *
> + * This program parses and reports the platform's HMAT tables, and registers
> + * the applicable attributes with the node's interfaces.
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/device.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/node.h>
> +#include <linux/sysfs.h>
> +
> +static __init const char *hmat_data_type(u8 type)
> +{
> +       switch (type) {
> +       case ACPI_HMAT_ACCESS_LATENCY:
> +               return "Access Latency";
> +       case ACPI_HMAT_READ_LATENCY:
> +               return "Read Latency";
> +       case ACPI_HMAT_WRITE_LATENCY:
> +               return "Write Latency";
> +       case ACPI_HMAT_ACCESS_BANDWIDTH:
> +               return "Access Bandwidth";
> +       case ACPI_HMAT_READ_BANDWIDTH:
> +               return "Read Bandwidth";
> +       case ACPI_HMAT_WRITE_BANDWIDTH:
> +               return "Write Bandwidth";
> +       default:
> +               return "Reserved";
> +       };
> +}
> +
> +static __init const char *hmat_data_type_suffix(u8 type)
> +{
> +       switch (type) {
> +       case ACPI_HMAT_ACCESS_LATENCY:
> +       case ACPI_HMAT_READ_LATENCY:
> +       case ACPI_HMAT_WRITE_LATENCY:
> +               return " nsec";
> +       case ACPI_HMAT_ACCESS_BANDWIDTH:
> +       case ACPI_HMAT_READ_BANDWIDTH:
> +       case ACPI_HMAT_WRITE_BANDWIDTH:
> +               return " MB/s";
> +       default:
> +               return "";
> +       };
> +}
> +
> +static __init int hmat_parse_locality(union acpi_subtable_headers *header,
> +                                     const unsigned long end)
> +{
> +       struct acpi_hmat_locality *hmat_loc = (void *)header;
> +       unsigned int init, targ, total_size, ipds, tpds;
> +       u32 *inits, *targs, value;
> +       u16 *entries;
> +       u8 type;
> +
> +       if (hmat_loc->header.length < sizeof(*hmat_loc)) {
> +               pr_debug("HMAT: Unexpected locality header length: %d\n",
> +                        hmat_loc->header.length);
> +               return -EINVAL;
> +       }
> +
> +       type = hmat_loc->data_type;
> +       ipds = hmat_loc->number_of_initiator_Pds;
> +       tpds = hmat_loc->number_of_target_Pds;
> +       total_size = sizeof(*hmat_loc) + sizeof(*entries) * ipds * tpds +
> +                    sizeof(*inits) * ipds + sizeof(*targs) * tpds;
> +       if (hmat_loc->header.length < total_size) {
> +               pr_debug("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
> +                        hmat_loc->header.length, total_size);
> +               return -EINVAL;
> +       }
> +
> +       pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
> +               hmat_loc->flags, hmat_data_type(type), ipds, tpds,
> +               hmat_loc->entry_base_unit);
> +
> +       inits = (u32 *)(hmat_loc + 1);
> +       targs = &inits[ipds];
> +       entries = (u16 *)(&targs[tpds]);
> +       for (init = 0; init < ipds; init++) {
> +               for (targ = 0; targ < tpds; targ++) {
> +                       value = entries[init * tpds + targ];
> +                       value = (value * hmat_loc->entry_base_unit) / 10;
> +                       pr_info("  Initiator-Target[%d-%d]:%d%s\n",
> +                               inits[init], targs[targ], value,
> +                               hmat_data_type_suffix(type));
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static __init int hmat_parse_cache(union acpi_subtable_headers *header,
> +                                  const unsigned long end)
> +{
> +       struct acpi_hmat_cache *cache = (void *)header;
> +       u32 attrs;
> +
> +       if (cache->header.length < sizeof(*cache)) {
> +               pr_debug("HMAT: Unexpected cache header length: %d\n",
> +                        cache->header.length);
> +               return -EINVAL;
> +       }
> +
> +       attrs = cache->cache_attributes;
> +       pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
> +               cache->memory_PD, cache->cache_size, attrs,
> +               cache->number_of_SMBIOShandles);
> +
> +       return 0;
> +}
> +
> +static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
> +                                          const unsigned long end)
> +{
> +       struct acpi_hmat_address_range *spa = (void *)header;
> +
> +       if (spa->header.length != sizeof(*spa)) {
> +               pr_debug("HMAT: Unexpected address range header length: %d\n",
> +                        spa->header.length);
> +               return -EINVAL;
> +       }
> +       pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
> +               spa->physical_address_base, spa->physical_address_length,
> +               spa->flags, spa->processor_PD, spa->memory_PD);
> +
> +       return 0;
> +}
> +
> +static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
> +                                     const unsigned long end)
> +{
> +       struct acpi_hmat_structure *hdr = (void *)header;
> +
> +       if (!hdr)
> +               return -EINVAL;
> +
> +       switch (hdr->type) {
> +       case ACPI_HMAT_TYPE_ADDRESS_RANGE:
> +               return hmat_parse_address_range(header, end);
> +       case ACPI_HMAT_TYPE_LOCALITY:
> +               return hmat_parse_locality(header, end);
> +       case ACPI_HMAT_TYPE_CACHE:
> +               return hmat_parse_cache(header, end);
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
> +static __init int hmat_init(void)
> +{
> +       struct acpi_table_header *tbl;
> +       enum acpi_hmat_type i;
> +       acpi_status status;
> +
> +       if (srat_disabled())
> +               return 0;
> +
> +       status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
> +       if (ACPI_FAILURE(status))
> +               return 0;
> +
> +       for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
> +               if (acpi_table_parse_entries(ACPI_SIG_HMAT,
> +                                            sizeof(struct acpi_table_hmat), i,
> +                                            hmat_parse_subtable, 0) < 0)
> +                       goto out_put;
> +       }
> +out_put:
> +       acpi_put_table(tbl);
> +       return 0;
> +}
> +subsys_initcall(hmat_init);
> --
> 2.14.4
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory
@ 2019-02-05 12:12     ` Rafael J. Wysocki
  0 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-05 12:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Fri, Jan 25, 2019 at 12:08 AM Keith Busch <keith.busch@intel.com> wrote:
>
> Systems may provide different memory types and export this information
> in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
> tables provided by the platform and report the memory access and caching
> attributes to the kernel messages.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/acpi/Kconfig       |   1 +
>  drivers/acpi/Makefile      |   1 +
>  drivers/acpi/hmat/Kconfig  |   8 ++
>  drivers/acpi/hmat/Makefile |   1 +
>  drivers/acpi/hmat/hmat.c   | 181 +++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 192 insertions(+)
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
>
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index 90ff0a47c12e..b377f970adfd 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -465,6 +465,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
>           If you are unsure what to do, do not enable this option.
>
>  source "drivers/acpi/nfit/Kconfig"
> +source "drivers/acpi/hmat/Kconfig"
>
>  source "drivers/acpi/apei/Kconfig"
>  source "drivers/acpi/dptf/Kconfig"
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index bb857421c2e8..5d361e4e3405 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -80,6 +80,7 @@ obj-$(CONFIG_ACPI_PROCESSOR)  += processor.o
>  obj-$(CONFIG_ACPI)             += container.o
>  obj-$(CONFIG_ACPI_THERMAL)     += thermal.o
>  obj-$(CONFIG_ACPI_NFIT)                += nfit/
> +obj-$(CONFIG_ACPI_HMAT)                += hmat/
>  obj-$(CONFIG_ACPI)             += acpi_memhotplug.o
>  obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>  obj-$(CONFIG_ACPI_BATTERY)     += battery.o
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> new file mode 100644
> index 000000000000..c9637e2e7514
> --- /dev/null
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config ACPI_HMAT
> +       bool "ACPI Heterogeneous Memory Attribute Table Support"
> +       depends on ACPI_NUMA
> +       help
> +        If set, this option causes the kernel to set the memory NUMA node
> +        relationships and access attributes in accordance with ACPI HMAT
> +        (Heterogeneous Memory Attributes Table).
> diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile
> new file mode 100644
> index 000000000000..e909051d3d00
> --- /dev/null
> +++ b/drivers/acpi/hmat/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_ACPI_HMAT) := hmat.o
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> new file mode 100644
> index 000000000000..1741bf30d87f
> --- /dev/null
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -0,0 +1,181 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, Intel Corporation.
> + *
> + * Heterogeneous Memory Attributes Table (HMAT) representation
> + *
> + * This program parses and reports the platform's HMAT tables, and registers
> + * the applicable attributes with the node's interfaces.
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/device.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/node.h>
> +#include <linux/sysfs.h>
> +
> +static __init const char *hmat_data_type(u8 type)
> +{
> +       switch (type) {
> +       case ACPI_HMAT_ACCESS_LATENCY:
> +               return "Access Latency";
> +       case ACPI_HMAT_READ_LATENCY:
> +               return "Read Latency";
> +       case ACPI_HMAT_WRITE_LATENCY:
> +               return "Write Latency";
> +       case ACPI_HMAT_ACCESS_BANDWIDTH:
> +               return "Access Bandwidth";
> +       case ACPI_HMAT_READ_BANDWIDTH:
> +               return "Read Bandwidth";
> +       case ACPI_HMAT_WRITE_BANDWIDTH:
> +               return "Write Bandwidth";
> +       default:
> +               return "Reserved";
> +       };
> +}
> +
> +static __init const char *hmat_data_type_suffix(u8 type)
> +{
> +       switch (type) {
> +       case ACPI_HMAT_ACCESS_LATENCY:
> +       case ACPI_HMAT_READ_LATENCY:
> +       case ACPI_HMAT_WRITE_LATENCY:
> +               return " nsec";
> +       case ACPI_HMAT_ACCESS_BANDWIDTH:
> +       case ACPI_HMAT_READ_BANDWIDTH:
> +       case ACPI_HMAT_WRITE_BANDWIDTH:
> +               return " MB/s";
> +       default:
> +               return "";
> +       };
> +}
> +
> +static __init int hmat_parse_locality(union acpi_subtable_headers *header,
> +                                     const unsigned long end)
> +{
> +       struct acpi_hmat_locality *hmat_loc = (void *)header;
> +       unsigned int init, targ, total_size, ipds, tpds;
> +       u32 *inits, *targs, value;
> +       u16 *entries;
> +       u8 type;
> +
> +       if (hmat_loc->header.length < sizeof(*hmat_loc)) {
> +               pr_debug("HMAT: Unexpected locality header length: %d\n",
> +                        hmat_loc->header.length);
> +               return -EINVAL;
> +       }
> +
> +       type = hmat_loc->data_type;
> +       ipds = hmat_loc->number_of_initiator_Pds;
> +       tpds = hmat_loc->number_of_target_Pds;
> +       total_size = sizeof(*hmat_loc) + sizeof(*entries) * ipds * tpds +
> +                    sizeof(*inits) * ipds + sizeof(*targs) * tpds;
> +       if (hmat_loc->header.length < total_size) {
> +               pr_debug("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
> +                        hmat_loc->header.length, total_size);
> +               return -EINVAL;
> +       }
> +
> +       pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
> +               hmat_loc->flags, hmat_data_type(type), ipds, tpds,
> +               hmat_loc->entry_base_unit);
> +
> +       inits = (u32 *)(hmat_loc + 1);
> +       targs = &inits[ipds];
> +       entries = (u16 *)(&targs[tpds]);
> +       for (init = 0; init < ipds; init++) {
> +               for (targ = 0; targ < tpds; targ++) {
> +                       value = entries[init * tpds + targ];
> +                       value = (value * hmat_loc->entry_base_unit) / 10;
> +                       pr_info("  Initiator-Target[%d-%d]:%d%s\n",
> +                               inits[init], targs[targ], value,
> +                               hmat_data_type_suffix(type));
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static __init int hmat_parse_cache(union acpi_subtable_headers *header,
> +                                  const unsigned long end)
> +{
> +       struct acpi_hmat_cache *cache = (void *)header;
> +       u32 attrs;
> +
> +       if (cache->header.length < sizeof(*cache)) {
> +               pr_debug("HMAT: Unexpected cache header length: %d\n",
> +                        cache->header.length);
> +               return -EINVAL;
> +       }
> +
> +       attrs = cache->cache_attributes;
> +       pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
> +               cache->memory_PD, cache->cache_size, attrs,
> +               cache->number_of_SMBIOShandles);
> +
> +       return 0;
> +}
> +
> +static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
> +                                          const unsigned long end)
> +{
> +       struct acpi_hmat_address_range *spa = (void *)header;
> +
> +       if (spa->header.length != sizeof(*spa)) {
> +               pr_debug("HMAT: Unexpected address range header length: %d\n",
> +                        spa->header.length);
> +               return -EINVAL;
> +       }
> +       pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
> +               spa->physical_address_base, spa->physical_address_length,
> +               spa->flags, spa->processor_PD, spa->memory_PD);
> +
> +       return 0;
> +}
> +
> +static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
> +                                     const unsigned long end)
> +{
> +       struct acpi_hmat_structure *hdr = (void *)header;
> +
> +       if (!hdr)
> +               return -EINVAL;
> +
> +       switch (hdr->type) {
> +       case ACPI_HMAT_TYPE_ADDRESS_RANGE:
> +               return hmat_parse_address_range(header, end);
> +       case ACPI_HMAT_TYPE_LOCALITY:
> +               return hmat_parse_locality(header, end);
> +       case ACPI_HMAT_TYPE_CACHE:
> +               return hmat_parse_cache(header, end);
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
> +static __init int hmat_init(void)
> +{
> +       struct acpi_table_header *tbl;
> +       enum acpi_hmat_type i;
> +       acpi_status status;
> +
> +       if (srat_disabled())
> +               return 0;
> +
> +       status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
> +       if (ACPI_FAILURE(status))
> +               return 0;
> +
> +       for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
> +               if (acpi_table_parse_entries(ACPI_SIG_HMAT,
> +                                            sizeof(struct acpi_table_hmat), i,
> +                                            hmat_parse_subtable, 0) < 0)
> +                       goto out_put;
> +       }
> +out_put:
> +       acpi_put_table(tbl);
> +       return 0;
> +}
> +subsys_initcall(hmat_init);
> --
> 2.14.4
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-01-24 23:07 ` [PATCHv5 04/10] node: Link memory nodes to their compute nodes Keith Busch
@ 2019-02-05 12:33     ` Rafael J. Wysocki
  2019-02-06 12:26     ` Jonathan Cameron
  2019-02-07 11:35     ` Rafael J. Wysocki
  2 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-05 12:33 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Fri, Jan 25, 2019 at 12:08 AM Keith Busch <keith.busch@intel.com> wrote:
>
> Systems may be constructed with various specialized nodes. Some nodes
> may provide memory, some provide compute devices that access and use
> that memory, and others may provide both. Nodes that provide memory are
> referred to as memory targets, and nodes that can initiate memory access
> are referred to as memory initiators.
>
> Memory targets will often have varying access characteristics from
> different initiators, and platforms may have ways to express those
> relationships. In preparation for these systems, provide interfaces for
> the kernel to export the memory relationship among different nodes memory
> targets and their initiators with symlinks to each other.
>
> If a system provides access locality for each initiator-target pair, nodes
> may be grouped into ranked access classes relative to other nodes. The
> new interface allows a subsystem to register relationships of varying
> classes if available and desired to be exported.
>
> A memory initiator may have multiple memory targets in the same access
> class. The target memory's initiators in a given class indicate the
> nodes access characteristics share the same performance relative to other
> linked initiator nodes. Each target within an initiator's access class,
> though, do not necessarily perform the same as each other.
>
> A memory target node may have multiple memory initiators. All linked
> initiators in a target's class have the same access characteristics to
> that target.
>
> The following example show the nodes' new sysfs hierarchy for a memory
> target node 'Y' with access class 0 from initiator node 'X':
>
>   # symlinks -v /sys/devices/system/node/nodeX/access0/
>   relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
>
>   # symlinks -v /sys/devices/system/node/nodeY/access0/
>   relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
>
> The new attributes are added to the sysfs stable documentation.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/ABI/stable/sysfs-devices-node |  25 ++++-
>  drivers/base/node.c                         | 142 +++++++++++++++++++++++++++-
>  include/linux/node.h                        |   7 +-
>  3 files changed, 171 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..fb843222a281 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:                December 2009
>  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
>  Description:
>                 The node's huge page size control/query attributes.
> -               See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +               See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:          /sys/devices/system/node/nodeX/accessY/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The node's relationship to other nodes for access class "Y".
> +
> +What:          /sys/devices/system/node/nodeX/accessY/initiators/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The directory containing symlinks to memory initiator
> +               nodes that have class "Y" access to this target node's
> +               memory. CPUs and other memory initiators in nodes not in
> +               the list accessing this node's memory may have different
> +               performance.
> +
> +What:          /sys/devices/system/node/nodeX/classY/targets/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The directory containing symlinks to memory targets that
> +               this initiator node has class "Y" access.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..6f4097680580 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/cpu.h>
>  #include <linux/device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
>
> @@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev,
>  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>
> +/**
> + * struct node_access_nodes - Access class device to hold user visible
> + *                           relationships to other nodes.
> + * @dev:       Device for this memory access class
> + * @list_node: List element in the node's access list
> + * @access:    The access class rank
> + */
> +struct node_access_nodes {
> +       struct device           dev;

I'm not sure if the entire struct device is needed here.

It looks like what you need is the kobject part of it only and you can
use a kobject directly here:

struct kobject        kobj;

Then, you can register that under the node's kobject using
kobject_init_and_add() and you can create attr groups under a kobject
using sysfs_create_groups(), which is exactly what device_add_groups()
does.

That would allow you to avoid allocating extra memory to hold the
entire device structure and the extra empty "power" subdirectory added
by device registration would not be there.

> +       struct list_head        list_node;
> +       unsigned                access;
> +};

Apart from the above, the patch looks good to me.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
@ 2019-02-05 12:33     ` Rafael J. Wysocki
  0 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-05 12:33 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Fri, Jan 25, 2019 at 12:08 AM Keith Busch <keith.busch@intel.com> wrote:
>
> Systems may be constructed with various specialized nodes. Some nodes
> may provide memory, some provide compute devices that access and use
> that memory, and others may provide both. Nodes that provide memory are
> referred to as memory targets, and nodes that can initiate memory access
> are referred to as memory initiators.
>
> Memory targets will often have varying access characteristics from
> different initiators, and platforms may have ways to express those
> relationships. In preparation for these systems, provide interfaces for
> the kernel to export the memory relationship among different nodes memory
> targets and their initiators with symlinks to each other.
>
> If a system provides access locality for each initiator-target pair, nodes
> may be grouped into ranked access classes relative to other nodes. The
> new interface allows a subsystem to register relationships of varying
> classes if available and desired to be exported.
>
> A memory initiator may have multiple memory targets in the same access
> class. The target memory's initiators in a given class indicate the
> nodes access characteristics share the same performance relative to other
> linked initiator nodes. Each target within an initiator's access class,
> though, do not necessarily perform the same as each other.
>
> A memory target node may have multiple memory initiators. All linked
> initiators in a target's class have the same access characteristics to
> that target.
>
> The following example show the nodes' new sysfs hierarchy for a memory
> target node 'Y' with access class 0 from initiator node 'X':
>
>   # symlinks -v /sys/devices/system/node/nodeX/access0/
>   relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
>
>   # symlinks -v /sys/devices/system/node/nodeY/access0/
>   relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
>
> The new attributes are added to the sysfs stable documentation.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/ABI/stable/sysfs-devices-node |  25 ++++-
>  drivers/base/node.c                         | 142 +++++++++++++++++++++++++++-
>  include/linux/node.h                        |   7 +-
>  3 files changed, 171 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..fb843222a281 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:                December 2009
>  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
>  Description:
>                 The node's huge page size control/query attributes.
> -               See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +               See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:          /sys/devices/system/node/nodeX/accessY/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The node's relationship to other nodes for access class "Y".
> +
> +What:          /sys/devices/system/node/nodeX/accessY/initiators/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The directory containing symlinks to memory initiator
> +               nodes that have class "Y" access to this target node's
> +               memory. CPUs and other memory initiators in nodes not in
> +               the list accessing this node's memory may have different
> +               performance.
> +
> +What:          /sys/devices/system/node/nodeX/classY/targets/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The directory containing symlinks to memory targets that
> +               this initiator node has class "Y" access.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..6f4097680580 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/cpu.h>
>  #include <linux/device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
>
> @@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev,
>  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>
> +/**
> + * struct node_access_nodes - Access class device to hold user visible
> + *                           relationships to other nodes.
> + * @dev:       Device for this memory access class
> + * @list_node: List element in the node's access list
> + * @access:    The access class rank
> + */
> +struct node_access_nodes {
> +       struct device           dev;

I'm not sure if the entire struct device is needed here.

It looks like what you need is the kobject part of it only and you can
use a kobject directly here:

struct kobject        kobj;

Then, you can register that under the node's kobject using
kobject_init_and_add() and you can create attr groups under a kobject
using sysfs_create_groups(), which is exactly what device_add_groups()
does.

That would allow you to avoid allocating extra memory to hold the
entire device structure and the extra empty "power" subdirectory added
by device registration would not be there.

> +       struct list_head        list_node;
> +       unsigned                access;
> +};

Apart from the above, the patch looks good to me.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-02-05 12:33     ` Rafael J. Wysocki
  (?)
@ 2019-02-05 14:48     ` Keith Busch
  -1 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2019-02-05 14:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Hansen, Dave,
	Williams, Dan J

On Tue, Feb 05, 2019 at 04:33:27AM -0800, Rafael J. Wysocki wrote:
> On Fri, Jan 25, 2019 at 12:08 AM Keith Busch <keith.busch@intel.com> wrote:
> > +/**
> > + * struct node_access_nodes - Access class device to hold user visible
> > + *                           relationships to other nodes.
> > + * @dev:       Device for this memory access class
> > + * @list_node: List element in the node's access list
> > + * @access:    The access class rank
> > + */
> > +struct node_access_nodes {
> > +       struct device           dev;
> 
> I'm not sure if the entire struct device is needed here.
> 
> It looks like what you need is the kobject part of it only and you can
> use a kobject directly here:
> 
> struct kobject        kobj;
> 
> Then, you can register that under the node's kobject using
> kobject_init_and_add() and you can create attr groups under a kobject
> using sysfs_create_groups(), which is exactly what device_add_groups()
> does.
> 
> That would allow you to avoid allocating extra memory to hold the
> entire device structure and the extra empty "power" subdirectory added
> by device registration would not be there.

This is conflicting with Greg's feedback from the first version of
this series:

  https://lore.kernel.org/lkml/20181126190619.GA32595@kroah.com/

Do you still recommend using kobject?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-02-05 12:33     ` Rafael J. Wysocki
  (?)
  (?)
@ 2019-02-05 14:52     ` Greg Kroah-Hartman
  2019-02-05 15:17         ` Rafael J. Wysocki
  -1 siblings, 1 reply; 53+ messages in thread
From: Greg Kroah-Hartman @ 2019-02-05 14:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Keith Busch, Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Dave Hansen, Dan Williams

On Tue, Feb 05, 2019 at 01:33:27PM +0100, Rafael J. Wysocki wrote:
> > +/**
> > + * struct node_access_nodes - Access class device to hold user visible
> > + *                           relationships to other nodes.
> > + * @dev:       Device for this memory access class
> > + * @list_node: List element in the node's access list
> > + * @access:    The access class rank
> > + */
> > +struct node_access_nodes {
> > +       struct device           dev;
> 
> I'm not sure if the entire struct device is needed here.
> 
> It looks like what you need is the kobject part of it only and you can
> use a kobject directly here:
> 
> struct kobject        kobj;
> 
> Then, you can register that under the node's kobject using
> kobject_init_and_add() and you can create attr groups under a kobject
> using sysfs_create_groups(), which is exactly what device_add_groups()
> does.
> 
> That would allow you to avoid allocating extra memory to hold the
> entire device structure and the extra empty "power" subdirectory added
> by device registration would not be there.

When you use a "raw" kobject then userspace tools do not see the devices
and attributes in libraries like udev.  So unless userspace does not
care about this at all, you should use a 'struct device' where ever
possible.  The memory "savings" usually just isn't worth it unless you
have a _lot_ of objects being created here.

Who is going to use all of this new information?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-02-05 14:52     ` Greg Kroah-Hartman
@ 2019-02-05 15:17         ` Rafael J. Wysocki
  0 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-05 15:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Rafael J. Wysocki, Keith Busch, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen, Dan Williams

On Tue, Feb 5, 2019 at 3:52 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Feb 05, 2019 at 01:33:27PM +0100, Rafael J. Wysocki wrote:
> > > +/**
> > > + * struct node_access_nodes - Access class device to hold user visible
> > > + *                           relationships to other nodes.
> > > + * @dev:       Device for this memory access class
> > > + * @list_node: List element in the node's access list
> > > + * @access:    The access class rank
> > > + */
> > > +struct node_access_nodes {
> > > +       struct device           dev;
> >
> > I'm not sure if the entire struct device is needed here.
> >
> > It looks like what you need is the kobject part of it only and you can
> > use a kobject directly here:
> >
> > struct kobject        kobj;
> >
> > Then, you can register that under the node's kobject using
> > kobject_init_and_add() and you can create attr groups under a kobject
> > using sysfs_create_groups(), which is exactly what device_add_groups()
> > does.
> >
> > That would allow you to avoid allocating extra memory to hold the
> > entire device structure and the extra empty "power" subdirectory added
> > by device registration would not be there.
>
> When you use a "raw" kobject then userspace tools do not see the devices
> and attributes in libraries like udev.

And why would they need it in this particular case?

> So unless userspace does not care about this at all,

Which I think is the case here, isn't it?

> you should use a 'struct device' where ever
> possible.  The memory "savings" usually just isn't worth it unless you
> have a _lot_ of objects being created here.
>
> Who is going to use all of this new information?

Somebody who wants to know how the memory in the system is laid out AFAICS.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
@ 2019-02-05 15:17         ` Rafael J. Wysocki
  0 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-05 15:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Rafael J. Wysocki, Keith Busch, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen, Dan Williams

On Tue, Feb 5, 2019 at 3:52 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Feb 05, 2019 at 01:33:27PM +0100, Rafael J. Wysocki wrote:
> > > +/**
> > > + * struct node_access_nodes - Access class device to hold user visible
> > > + *                           relationships to other nodes.
> > > + * @dev:       Device for this memory access class
> > > + * @list_node: List element in the node's access list
> > > + * @access:    The access class rank
> > > + */
> > > +struct node_access_nodes {
> > > +       struct device           dev;
> >
> > I'm not sure if the entire struct device is needed here.
> >
> > It looks like what you need is the kobject part of it only and you can
> > use a kobject directly here:
> >
> > struct kobject        kobj;
> >
> > Then, you can register that under the node's kobject using
> > kobject_init_and_add() and you can create attr groups under a kobject
> > using sysfs_create_groups(), which is exactly what device_add_groups()
> > does.
> >
> > That would allow you to avoid allocating extra memory to hold the
> > entire device structure and the extra empty "power" subdirectory added
> > by device registration would not be there.
>
> When you use a "raw" kobject then userspace tools do not see the devices
> and attributes in libraries like udev.

And why would they need it in this particular case?

> So unless userspace does not care about this at all,

Which I think is the case here, isn't it?

> you should use a 'struct device' where ever
> possible.  The memory "savings" usually just isn't worth it unless you
> have a _lot_ of objects being created here.
>
> Who is going to use all of this new information?

Somebody who wants to know how the memory in the system is laid out AFAICS.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 10/10] doc/mm: New documentation for memory performance
  2019-01-24 23:07 ` [PATCHv5 10/10] doc/mm: New documentation for memory performance Keith Busch
@ 2019-02-06 10:45     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 10:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Thu, 24 Jan 2019 16:07:24 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Platforms may provide system memory where some physical address ranges
> perform differently than others, or is side cached by the system.
> 
> Add documentation describing a high level overview of such systems and the
> perforamnce and caching attributes the kernel provides for applications
> wishing to query this information.
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
Hi Keith,

Nice doc in general. Comments inline.

> ---
>  Documentation/admin-guide/mm/numaperf.rst | 167 ++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
> 
> diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
> new file mode 100644
> index 000000000000..52999336a8ed
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/numaperf.rst
> @@ -0,0 +1,167 @@
> +.. _numaperf:
> +
> +=============
> +NUMA Locality
> +=============
> +
> +Some platforms may have multiple types of memory attached to a single
> +CPU. These disparate memory ranges share some characteristics, such as
> +CPU cache coherence, but may have different performance. For example,
> +different media types and buses affect bandwidth and latency.

This seems a bit restrictive, but I it gives a starting point.
I guess anyone who has a more complex system should look elsewhere for
how this maps to it!

> +
> +A system supporting such heterogeneous memory by grouping each memory
> +type under different "nodes" based on similar CPU locality and performance
> +characteristics.  Some memory may share the same node as a CPU, and others
> +are provided as memory only nodes. While memory only nodes do not provide
> +CPUs, they may still be directly accessible, or local, to one or more
> +compute nodes.

Perhaps define directly accessible?  I'm not keen on saying that they don't
involve an interconnect as that rules out things like CCIX with remote
memory homes.  The reality is this patch set works fine for that case.

The one or more compute nodes can only happen (I think) with a very weird
setup of an interconnect involved which is likely to have other data on it.

+ The following diagram shows one such example of two compute
> +nodes with local memory and a memory only node for each of compute node:
> +
> + +------------------+     +------------------+
> + | Compute Node 0   +-----+ Compute Node 1   |
> + | Local Node0 Mem  |     | Local Node1 Mem  |
> + +--------+---------+     +--------+---------+
> +          |                        |
> + +--------+---------+     +--------+---------+
> + | Slower Node2 Mem |     | Slower Node3 Mem |
> + +------------------+     +--------+---------+
> +
> +A "memory initiator" is a node containing one or more devices such as
> +CPUs or separate memory I/O devices that can initiate memory requests.
> +A "memory target" is a node containing one or more physical address
> +ranges accessible from one or more memory initiators.
> +
> +When multiple memory initiators exist, they may not all have the same
> +performance when accessing a given memory target. Each initiator-target
> +pair may be organized into different ranked access classes to represent
> +this relationship. The highest performing initiator to a given target
> +is considered to be one of that target's local initiators, and given
> +the highest access class, 0. Any given target may have one or more
> +local initiators, and any given initiator may have multiple local
> +memory targets.
> +
> +To aid applications matching memory targets with their initiators, the
> +kernel provides symlinks to each other. The following example lists the
> +relationship for the access class "0" memory initiators and targets, which is
> +the of nodes with the highest performing access relationship::
> +
> +	# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
> +	relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
> +
> +	# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
> +	relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
> +
> +================
> +NUMA Performance
> +================
> +
> +Applications may wish to consider which node they want their memory to
> +be allocated from based on the node's performance characteristics. If
> +the system provides these attributes, the kernel exports them under the
> +node sysfs hierarchy by appending the attributes directory under the
> +memory node's access class 0 initiators as follows::
> +
> +	/sys/devices/system/node/nodeY/access0/initiators/
> +
> +These attributes apply only when accessed from nodes that have the
> +are linked under the this access's inititiators.
> +
> +The performance characteristics the kernel provides for the local initiators
> +are exported are as follows::
> +
> +	# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/
> +	/sys/devices/system/node/nodeY/access0/
> +	|-- read_bandwidth
> +	|-- read_latency
> +	|-- write_bandwidth
> +	`-- write_latency

These seem to be under
/sys/devices/system/node/nodeY/access0/initiators/
(so one directory deeper).

> +
> +The bandwidth attributes are provided in MiB/second.
> +
> +The latency attributes are provided in nanoseconds.
> +
> +The values reported here correspond to the rated latency and bandwidth
> +for the platform.
> +
> +==========
> +NUMA Cache
> +==========
> +
> +System memory may be constructed in a hierarchy of elements with various
> +performance characteristics in order to provide large address space of
> +slower performing memory side-cached by a smaller higher performing
> +memory. The system physical addresses that initiators are aware of
> +are provided by the last memory level in the hierarchy. The system
> +meanwhile uses higher performing memory to transparently cache access
> +to progressively slower levels.
> +
> +The term "far memory" is used to denote the last level memory in the
> +hierarchy. Each increasing cache level provides higher performing
> +initiator access, and the term "near memory" represents the fastest
> +cache provided by the system.
> +
> +This numbering is different than CPU caches where the cache level (ex:
> +L1, L2, L3) uses a CPU centric view with each increased level is lower
> +performing. In contrast, the memory cache level is centric to the last
> +level memory, so the higher numbered cache level denotes memory nearer
> +to the CPU, and further from far memory.
> +
> +The memory side caches are not directly addressable by software. When
> +software accesses a system address, the system will return it from the
> +near memory cache if it is present. If it is not present, the system
> +accesses the next level of memory until there is either a hit in that
> +cache level, or it reaches far memory.
> +
> +An application does not need to know about caching attributes in order
> +to use the system. Software may optionally query the memory cache
> +attributes in order to maximize the performance out of such a setup.
> +If the system provides a way for the kernel to discover this information,
> +for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
> +the kernel will append these attributes to the NUMA node memory target.
> +
> +When the kernel first registers a memory cache with a node, the kernel
> +will create the following directory::
> +
> +	/sys/devices/system/node/nodeX/side_cache/
> +
> +If that directory is not present, the system either does not not provide
> +a memory side cache, or that information is not accessible to the kernel.
> +
> +The attributes for each level of cache is provided under its cache
> +level index::
> +
> +	/sys/devices/system/node/nodeX/side_cache/indexA/
> +	/sys/devices/system/node/nodeX/side_cache/indexB/
> +	/sys/devices/system/node/nodeX/side_cache/indexC/
> +
> +Each cache level's directory provides its attributes. For example, the
> +following shows a single cache level and the attributes available for
> +software to query::
> +
> +	# tree sys/devices/system/node/node0/side_cache/
> +	/sys/devices/system/node/node0/side_cache/
> +	|-- index1
> +	|   |-- associativity
> +	|   |-- level

What is the purpose of having level in here?  Isn't it the same as the A..C
in the index naming?

> +	|   |-- line_size
> +	|   |-- size
> +	|   `-- write_policy
> +
> +The "associativity" will be 0 if it is a direct-mapped cache, and non-zero
> +for any other indexed based, multi-way associativity.

Is it worth providing the ACPI mapping in this doc?  We have None, Direct and
'complex'.   Fun question of what None means?  Not specified?

> +
> +The "level" is the distance from the far memory, and matches the number
> +appended to its "index" directory.
> +
> +The "line_size" is the number of bytes accessed on a cache miss.

Maybe "number of bytes accessed from next cache level" ?

> +
> +The "size" is the number of bytes provided by this cache level.
> +
> +The "write_policy" will be 0 for write-back, and non-zero for
> +write-through caching.
> +
> +========
> +See Also
> +========
> +.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> +       Section 5.2.27

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 10/10] doc/mm: New documentation for memory performance
@ 2019-02-06 10:45     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 10:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Thu, 24 Jan 2019 16:07:24 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Platforms may provide system memory where some physical address ranges
> perform differently than others, or is side cached by the system.
> 
> Add documentation describing a high level overview of such systems and the
> perforamnce and caching attributes the kernel provides for applications
> wishing to query this information.
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
Hi Keith,

Nice doc in general. Comments inline.

> ---
>  Documentation/admin-guide/mm/numaperf.rst | 167 ++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
> 
> diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
> new file mode 100644
> index 000000000000..52999336a8ed
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/numaperf.rst
> @@ -0,0 +1,167 @@
> +.. _numaperf:
> +
> +=============
> +NUMA Locality
> +=============
> +
> +Some platforms may have multiple types of memory attached to a single
> +CPU. These disparate memory ranges share some characteristics, such as
> +CPU cache coherence, but may have different performance. For example,
> +different media types and buses affect bandwidth and latency.

This seems a bit restrictive, but I it gives a starting point.
I guess anyone who has a more complex system should look elsewhere for
how this maps to it!

> +
> +A system supporting such heterogeneous memory by grouping each memory
> +type under different "nodes" based on similar CPU locality and performance
> +characteristics.  Some memory may share the same node as a CPU, and others
> +are provided as memory only nodes. While memory only nodes do not provide
> +CPUs, they may still be directly accessible, or local, to one or more
> +compute nodes.

Perhaps define directly accessible?  I'm not keen on saying that they don't
involve an interconnect as that rules out things like CCIX with remote
memory homes.  The reality is this patch set works fine for that case.

The one or more compute nodes can only happen (I think) with a very weird
setup of an interconnect involved which is likely to have other data on it.

+ The following diagram shows one such example of two compute
> +nodes with local memory and a memory only node for each of compute node:
> +
> + +------------------+     +------------------+
> + | Compute Node 0   +-----+ Compute Node 1   |
> + | Local Node0 Mem  |     | Local Node1 Mem  |
> + +--------+---------+     +--------+---------+
> +          |                        |
> + +--------+---------+     +--------+---------+
> + | Slower Node2 Mem |     | Slower Node3 Mem |
> + +------------------+     +--------+---------+
> +
> +A "memory initiator" is a node containing one or more devices such as
> +CPUs or separate memory I/O devices that can initiate memory requests.
> +A "memory target" is a node containing one or more physical address
> +ranges accessible from one or more memory initiators.
> +
> +When multiple memory initiators exist, they may not all have the same
> +performance when accessing a given memory target. Each initiator-target
> +pair may be organized into different ranked access classes to represent
> +this relationship. The highest performing initiator to a given target
> +is considered to be one of that target's local initiators, and given
> +the highest access class, 0. Any given target may have one or more
> +local initiators, and any given initiator may have multiple local
> +memory targets.
> +
> +To aid applications matching memory targets with their initiators, the
> +kernel provides symlinks to each other. The following example lists the
> +relationship for the access class "0" memory initiators and targets, which is
> +the of nodes with the highest performing access relationship::
> +
> +	# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
> +	relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
> +
> +	# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
> +	relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
> +
> +================
> +NUMA Performance
> +================
> +
> +Applications may wish to consider which node they want their memory to
> +be allocated from based on the node's performance characteristics. If
> +the system provides these attributes, the kernel exports them under the
> +node sysfs hierarchy by appending the attributes directory under the
> +memory node's access class 0 initiators as follows::
> +
> +	/sys/devices/system/node/nodeY/access0/initiators/
> +
> +These attributes apply only when accessed from nodes that have the
> +are linked under the this access's inititiators.
> +
> +The performance characteristics the kernel provides for the local initiators
> +are exported are as follows::
> +
> +	# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/
> +	/sys/devices/system/node/nodeY/access0/
> +	|-- read_bandwidth
> +	|-- read_latency
> +	|-- write_bandwidth
> +	`-- write_latency

These seem to be under
/sys/devices/system/node/nodeY/access0/initiators/
(so one directory deeper).

> +
> +The bandwidth attributes are provided in MiB/second.
> +
> +The latency attributes are provided in nanoseconds.
> +
> +The values reported here correspond to the rated latency and bandwidth
> +for the platform.
> +
> +==========
> +NUMA Cache
> +==========
> +
> +System memory may be constructed in a hierarchy of elements with various
> +performance characteristics in order to provide large address space of
> +slower performing memory side-cached by a smaller higher performing
> +memory. The system physical addresses that initiators are aware of
> +are provided by the last memory level in the hierarchy. The system
> +meanwhile uses higher performing memory to transparently cache access
> +to progressively slower levels.
> +
> +The term "far memory" is used to denote the last level memory in the
> +hierarchy. Each increasing cache level provides higher performing
> +initiator access, and the term "near memory" represents the fastest
> +cache provided by the system.
> +
> +This numbering is different than CPU caches where the cache level (ex:
> +L1, L2, L3) uses a CPU centric view with each increased level is lower
> +performing. In contrast, the memory cache level is centric to the last
> +level memory, so the higher numbered cache level denotes memory nearer
> +to the CPU, and further from far memory.
> +
> +The memory side caches are not directly addressable by software. When
> +software accesses a system address, the system will return it from the
> +near memory cache if it is present. If it is not present, the system
> +accesses the next level of memory until there is either a hit in that
> +cache level, or it reaches far memory.
> +
> +An application does not need to know about caching attributes in order
> +to use the system. Software may optionally query the memory cache
> +attributes in order to maximize the performance out of such a setup.
> +If the system provides a way for the kernel to discover this information,
> +for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
> +the kernel will append these attributes to the NUMA node memory target.
> +
> +When the kernel first registers a memory cache with a node, the kernel
> +will create the following directory::
> +
> +	/sys/devices/system/node/nodeX/side_cache/
> +
> +If that directory is not present, the system either does not not provide
> +a memory side cache, or that information is not accessible to the kernel.
> +
> +The attributes for each level of cache is provided under its cache
> +level index::
> +
> +	/sys/devices/system/node/nodeX/side_cache/indexA/
> +	/sys/devices/system/node/nodeX/side_cache/indexB/
> +	/sys/devices/system/node/nodeX/side_cache/indexC/
> +
> +Each cache level's directory provides its attributes. For example, the
> +following shows a single cache level and the attributes available for
> +software to query::
> +
> +	# tree sys/devices/system/node/node0/side_cache/
> +	/sys/devices/system/node/node0/side_cache/
> +	|-- index1
> +	|   |-- associativity
> +	|   |-- level

What is the purpose of having level in here?  Isn't it the same as the A..C
in the index naming?

> +	|   |-- line_size
> +	|   |-- size
> +	|   `-- write_policy
> +
> +The "associativity" will be 0 if it is a direct-mapped cache, and non-zero
> +for any other indexed based, multi-way associativity.

Is it worth providing the ACPI mapping in this doc?  We have None, Direct and
'complex'.   Fun question of what None means?  Not specified?

> +
> +The "level" is the distance from the far memory, and matches the number
> +appended to its "index" directory.
> +
> +The "line_size" is the number of bytes accessed on a cache miss.

Maybe "number of bytes accessed from next cache level" ?

> +
> +The "size" is the number of bytes provided by this cache level.
> +
> +The "write_policy" will be 0 for write-back, and non-zero for
> +write-through caching.
> +
> +========
> +See Also
> +========
> +.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> +       Section 5.2.27



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 09/10] acpi/hmat: Register memory side cache attributes
  2019-01-24 23:07 ` [PATCHv5 09/10] acpi/hmat: Register memory side cache attributes Keith Busch
@ 2019-02-06 12:17     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:23 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Register memory side cache attributes with the memory's node if HMAT
> provides the side cache iniformation table.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
Trivial suggestion inline.
> ---
>  drivers/acpi/hmat/hmat.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 917e6122b3f0..11f65b38e9f9 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -245,6 +245,7 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
>  				   const unsigned long end)
>  {
>  	struct acpi_hmat_cache *cache = (void *)header;
> +	struct node_cache_attrs cache_attrs;
>  	u32 attrs;
>  
>  	if (cache->header.length < sizeof(*cache)) {
> @@ -258,6 +259,37 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
>  		cache->memory_PD, cache->cache_size, attrs,
>  		cache->number_of_SMBIOShandles);
>  
> +	cache_attrs.size = cache->cache_size;
> +	cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4;
> +	cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16;
> +
> +	switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) {

FIELD_GET might be nice for these to avoid having the shifts and the mask.

> +	case ACPI_HMAT_CA_DIRECT_MAPPED:
> +		cache_attrs.associativity = NODE_CACHE_DIRECT_MAP;
> +		break;
> +	case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING:
> +		cache_attrs.associativity = NODE_CACHE_INDEXED;
> +		break;
> +	case ACPI_HMAT_CA_NONE:
> +	default:
> +		cache_attrs.associativity = NODE_CACHE_OTHER;
> +		break;
> +	}
> +
> +	switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) {
> +	case ACPI_HMAT_CP_WB:
> +		cache_attrs.write_policy = NODE_CACHE_WRITE_BACK;
> +		break;
> +	case ACPI_HMAT_CP_WT:
> +		cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH;
> +		break;
> +	case ACPI_HMAT_CP_NONE:
> +	default:
> +		cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER;
> +		break;
> +	}
> +
> +	node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs);
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 09/10] acpi/hmat: Register memory side cache attributes
@ 2019-02-06 12:17     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:23 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Register memory side cache attributes with the memory's node if HMAT
> provides the side cache iniformation table.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
Trivial suggestion inline.
> ---
>  drivers/acpi/hmat/hmat.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 917e6122b3f0..11f65b38e9f9 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -245,6 +245,7 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
>  				   const unsigned long end)
>  {
>  	struct acpi_hmat_cache *cache = (void *)header;
> +	struct node_cache_attrs cache_attrs;
>  	u32 attrs;
>  
>  	if (cache->header.length < sizeof(*cache)) {
> @@ -258,6 +259,37 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
>  		cache->memory_PD, cache->cache_size, attrs,
>  		cache->number_of_SMBIOShandles);
>  
> +	cache_attrs.size = cache->cache_size;
> +	cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4;
> +	cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16;
> +
> +	switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) {

FIELD_GET might be nice for these to avoid having the shifts and the mask.

> +	case ACPI_HMAT_CA_DIRECT_MAPPED:
> +		cache_attrs.associativity = NODE_CACHE_DIRECT_MAP;
> +		break;
> +	case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING:
> +		cache_attrs.associativity = NODE_CACHE_INDEXED;
> +		break;
> +	case ACPI_HMAT_CA_NONE:
> +	default:
> +		cache_attrs.associativity = NODE_CACHE_OTHER;
> +		break;
> +	}
> +
> +	switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) {
> +	case ACPI_HMAT_CP_WB:
> +		cache_attrs.write_policy = NODE_CACHE_WRITE_BACK;
> +		break;
> +	case ACPI_HMAT_CP_WT:
> +		cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH;
> +		break;
> +	case ACPI_HMAT_CP_NONE:
> +	default:
> +		cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER;
> +		break;
> +	}
> +
> +	node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs);
>  	return 0;
>  }
>  



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 08/10] node: Add memory caching attributes
  2019-01-24 23:07 ` [PATCHv5 08/10] node: Add memory caching attributes Keith Busch
@ 2019-02-06 12:24     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:22 -0700
Keith Busch <keith.busch@intel.com> wrote:

> System memory may have side caches to help improve access speed to
> frequently requested address ranges. While the system provided cache is
> transparent to the software accessing these memory ranges, applications
> can optimize their own access based on cache attributes.
> 
> Provide a new API for the kernel to register these memory side caches
> under the memory node that provides it.
> 
> The new sysfs representation is modeled from the existing cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/<cpu>/side_cache/.
<cpu>/cache/?

> Unlike CPU cacheinfo though, the node cache level is reported from
> the view of the memory. A higher number is nearer to the CPU, while
> lower levels are closer to the backing memory. Also unlike CPU cache,
> it is assumed the system will handle flushing any dirty cached memory
> to the last level on a power failure if the range is persistent memory.
That's a design choice.  A sensible one perhaps, but not a requirement of
this infrastructure.  Not that it really matters as who reads patch
descriptions after they are applied? :)

> 
> The attributes we export are the cache size, the line size, associativity,
> and write back policy.
> 
> Add the attributes for the system memory side caches to sysfs stable
> documentation.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
A few minor points inline.

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  34 +++++++
>  drivers/base/node.c                         | 153 ++++++++++++++++++++++++++++
>  include/linux/node.h                        |  34 +++++++
>  3 files changed, 221 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 41cb9345e1e0..26327279b6b6 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -142,3 +142,37 @@ Contact:	Keith Busch <keith.busch@intel.com>
>  Description:
>  		This node's write latency in nanoseconds when access
>  		from nodes found in this class's linked initiators.
> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/associativity
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The caches associativity: 0 for direct mapped, non-zero if
> +		indexed.
> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/level
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		This cache's level in the memory hierarchy. Matches 'Y' in the
> +		directory name.

Mentioned in the docs, but I'm not sure why we need this given it matches the
directory name in which it is found...  For the cpu caches they aren't the
same (data vs instruction) for example.

> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/line_size
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The number of bytes accessed from the next cache level on a
> +		cache miss.
> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/size
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The size of this memory side cache in bytes.
> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/write_policy
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The cache write policy: 0 for write-back, 1 for write-through,
> +		other or unknown.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 2de546a040a5..9b4cb29863ff 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -205,6 +205,157 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>  		}
>  	}
>  }
> +
> +/**
> + * struct node_cache_info - Internal tracking for memory node caches
> + * @dev:	Device represeting the cache level
> + * @node:	List element for tracking in the node
> + * @cache_attrs:Attributes for this cache level
> + */
> +struct node_cache_info {
> +	struct device dev;
> +	struct list_head node;
> +	struct node_cache_attrs cache_attrs;
> +};
> +#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
> +
> +#define CACHE_ATTR(name, fmt) 						\
> +static ssize_t name##_show(struct device *dev,				\
> +			   struct device_attribute *attr,		\
> +			   char *buf)					\
> +{									\
> +	return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> +}									\
> +DEVICE_ATTR_RO(name);
> +
> +CACHE_ATTR(size, "%llu")
> +CACHE_ATTR(level, "%u")
> +CACHE_ATTR(line_size, "%u")
> +CACHE_ATTR(associativity, "%u")
> +CACHE_ATTR(write_policy, "%u")
> +
> +static struct attribute *cache_attrs[] = {
> +	&dev_attr_level.attr,
> +	&dev_attr_associativity.attr,
> +	&dev_attr_size.attr,
> +	&dev_attr_line_size.attr,
> +	&dev_attr_write_policy.attr,
> +	NULL,
> +};
> +ATTRIBUTE_GROUPS(cache);
> +
> +static void node_cache_release(struct device *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static void node_cacheinfo_release(struct device *dev)
> +{
> +	struct node_cache_info *info = to_cache_info(dev);
> +	kfree(info);
> +}
> +
> +static void node_init_cache_dev(struct node *node)
> +{
> +	struct device *dev;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return;
> +
> +	dev->parent = &node->dev;
> +	dev->release = node_cache_release;
> +	if (dev_set_name(dev, "side_cache"))
> +		goto free_dev;
> +
> +	if (device_register(dev))
> +		goto free_name;
> +
> +	pm_runtime_no_callbacks(dev);
> +	node->cache_dev = dev;
> +	return;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free_dev:
> +	kfree(dev);
> +}
> +
> +/**
> + * node_add_cache - add cache attribute to a memory node
This is almost but not quite in kernel-doc.
node_add_cache() - add 

IIRC.

> + * @nid: Node identifier that has new cache attributes
> + * @cache_attrs: Attributes for the cache being added
> + */
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
> +{
> +	struct node_cache_info *info;
> +	struct device *dev;
> +	struct node *node;
> +
> +	if (!node_online(nid) || !node_devices[nid])
> +		return;
> +
> +	node = node_devices[nid];
> +	list_for_each_entry(info, &node->cache_attrs, node) {
> +		if (info->cache_attrs.level == cache_attrs->level) {
> +			dev_warn(&node->dev,
> +				"attempt to add duplicate cache level:%d\n",
> +				cache_attrs->level);
> +			return;
> +		}
> +	}
> +
> +	if (!node->cache_dev)
> +		node_init_cache_dev(node);
> +	if (!node->cache_dev)
> +		return;
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info)
> +		return;
> +
> +	dev = &info->dev;
> +	dev->parent = node->cache_dev;
> +	dev->release = node_cacheinfo_release;
> +	dev->groups = cache_groups;
> +	if (dev_set_name(dev, "index%d", cache_attrs->level))
> +		goto free_cache;
> +
> +	info->cache_attrs = *cache_attrs;
> +	if (device_register(dev)) {
> +		dev_warn(&node->dev, "failed to add cache level:%d\n",
> +			 cache_attrs->level);
> +		goto free_name;
> +	}
> +	pm_runtime_no_callbacks(dev);
> +	list_add_tail(&info->node, &node->cache_attrs);
> +	return;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free_cache:
> +	kfree(info);
> +}
> +
> +static void node_remove_caches(struct node *node)
> +{
> +	struct node_cache_info *info, *next;
> +
> +	if (!node->cache_dev)
> +		return;
> +
> +	list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
> +		list_del(&info->node);
> +		device_unregister(&info->dev);
> +	}
> +	device_unregister(node->cache_dev);
> +}
> +
> +static void node_init_caches(unsigned int nid)
> +{
> +	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
> +}
> +#else
> +static void node_init_caches(unsigned int nid) { }
> +static void node_remove_caches(struct node *node) { }
>  #endif
>  
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
> @@ -489,6 +640,7 @@ void unregister_node(struct node *node)
>  {
>  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
>  	node_remove_accesses(node);
> +	node_remove_caches(node);
>  	device_unregister(&node->dev);
>  }
>  
> @@ -781,6 +933,7 @@ int __register_one_node(int nid)
>  	INIT_LIST_HEAD(&node_devices[nid]->access_list);
>  	/* initialize work queue for memory hot plug */
>  	init_node_hugetlb_work(nid);
> +	node_init_caches(nid);
>  
>  	return error;
>  }
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 2db077363d9c..842e4ab2ae6d 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -37,6 +37,36 @@ struct node_hmem_attrs {
>  };
>  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>  			 unsigned access);
> +
> +enum cache_associativity {
> +	NODE_CACHE_DIRECT_MAP,
> +	NODE_CACHE_INDEXED,
> +	NODE_CACHE_OTHER,
> +};
> +
> +enum cache_write_policy {
> +	NODE_CACHE_WRITE_BACK,
> +	NODE_CACHE_WRITE_THROUGH,
> +	NODE_CACHE_WRITE_OTHER,
> +};
> +
> +/**
> + * struct node_cache_attrs - system memory caching attributes
> + *
> + * @associativity:	The ways memory blocks may be placed in cache
> + * @write_policy:	Write back or write through policy
> + * @size:		Total size of cache in bytes
> + * @line_size:		Number of bytes fetched on a cache miss
> + * @level:		Represents the cache hierarchy level
> + */
> +struct node_cache_attrs {
> +	enum cache_associativity associativity;
> +	enum cache_write_policy write_policy;
> +	u64 size;
> +	u16 line_size;
> +	u8  level;
> +};
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
>  #endif
>  
>  struct node {
> @@ -45,6 +75,10 @@ struct node {
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>  	struct work_struct	node_work;
>  #endif
> +#ifdef CONFIG_HMEM_REPORTING
> +	struct list_head cache_attrs;
> +	struct device *cache_dev;
> +#endif
>  };
>  
>  struct memory_block;

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 08/10] node: Add memory caching attributes
@ 2019-02-06 12:24     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:22 -0700
Keith Busch <keith.busch@intel.com> wrote:

> System memory may have side caches to help improve access speed to
> frequently requested address ranges. While the system provided cache is
> transparent to the software accessing these memory ranges, applications
> can optimize their own access based on cache attributes.
> 
> Provide a new API for the kernel to register these memory side caches
> under the memory node that provides it.
> 
> The new sysfs representation is modeled from the existing cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/<cpu>/side_cache/.
<cpu>/cache/?

> Unlike CPU cacheinfo though, the node cache level is reported from
> the view of the memory. A higher number is nearer to the CPU, while
> lower levels are closer to the backing memory. Also unlike CPU cache,
> it is assumed the system will handle flushing any dirty cached memory
> to the last level on a power failure if the range is persistent memory.
That's a design choice.  A sensible one perhaps, but not a requirement of
this infrastructure.  Not that it really matters as who reads patch
descriptions after they are applied? :)

> 
> The attributes we export are the cache size, the line size, associativity,
> and write back policy.
> 
> Add the attributes for the system memory side caches to sysfs stable
> documentation.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
A few minor points inline.

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  34 +++++++
>  drivers/base/node.c                         | 153 ++++++++++++++++++++++++++++
>  include/linux/node.h                        |  34 +++++++
>  3 files changed, 221 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 41cb9345e1e0..26327279b6b6 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -142,3 +142,37 @@ Contact:	Keith Busch <keith.busch@intel.com>
>  Description:
>  		This node's write latency in nanoseconds when access
>  		from nodes found in this class's linked initiators.
> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/associativity
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The caches associativity: 0 for direct mapped, non-zero if
> +		indexed.
> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/level
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		This cache's level in the memory hierarchy. Matches 'Y' in the
> +		directory name.

Mentioned in the docs, but I'm not sure why we need this given it matches the
directory name in which it is found...  For the cpu caches they aren't the
same (data vs instruction) for example.

> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/line_size
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The number of bytes accessed from the next cache level on a
> +		cache miss.
> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/size
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The size of this memory side cache in bytes.
> +
> +What:		/sys/devices/system/node/nodeX/side_cache/indexY/write_policy
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The cache write policy: 0 for write-back, 1 for write-through,
> +		other or unknown.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 2de546a040a5..9b4cb29863ff 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -205,6 +205,157 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>  		}
>  	}
>  }
> +
> +/**
> + * struct node_cache_info - Internal tracking for memory node caches
> + * @dev:	Device represeting the cache level
> + * @node:	List element for tracking in the node
> + * @cache_attrs:Attributes for this cache level
> + */
> +struct node_cache_info {
> +	struct device dev;
> +	struct list_head node;
> +	struct node_cache_attrs cache_attrs;
> +};
> +#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
> +
> +#define CACHE_ATTR(name, fmt) 						\
> +static ssize_t name##_show(struct device *dev,				\
> +			   struct device_attribute *attr,		\
> +			   char *buf)					\
> +{									\
> +	return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> +}									\
> +DEVICE_ATTR_RO(name);
> +
> +CACHE_ATTR(size, "%llu")
> +CACHE_ATTR(level, "%u")
> +CACHE_ATTR(line_size, "%u")
> +CACHE_ATTR(associativity, "%u")
> +CACHE_ATTR(write_policy, "%u")
> +
> +static struct attribute *cache_attrs[] = {
> +	&dev_attr_level.attr,
> +	&dev_attr_associativity.attr,
> +	&dev_attr_size.attr,
> +	&dev_attr_line_size.attr,
> +	&dev_attr_write_policy.attr,
> +	NULL,
> +};
> +ATTRIBUTE_GROUPS(cache);
> +
> +static void node_cache_release(struct device *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static void node_cacheinfo_release(struct device *dev)
> +{
> +	struct node_cache_info *info = to_cache_info(dev);
> +	kfree(info);
> +}
> +
> +static void node_init_cache_dev(struct node *node)
> +{
> +	struct device *dev;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return;
> +
> +	dev->parent = &node->dev;
> +	dev->release = node_cache_release;
> +	if (dev_set_name(dev, "side_cache"))
> +		goto free_dev;
> +
> +	if (device_register(dev))
> +		goto free_name;
> +
> +	pm_runtime_no_callbacks(dev);
> +	node->cache_dev = dev;
> +	return;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free_dev:
> +	kfree(dev);
> +}
> +
> +/**
> + * node_add_cache - add cache attribute to a memory node
This is almost but not quite in kernel-doc.
node_add_cache() - add 

IIRC.

> + * @nid: Node identifier that has new cache attributes
> + * @cache_attrs: Attributes for the cache being added
> + */
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
> +{
> +	struct node_cache_info *info;
> +	struct device *dev;
> +	struct node *node;
> +
> +	if (!node_online(nid) || !node_devices[nid])
> +		return;
> +
> +	node = node_devices[nid];
> +	list_for_each_entry(info, &node->cache_attrs, node) {
> +		if (info->cache_attrs.level == cache_attrs->level) {
> +			dev_warn(&node->dev,
> +				"attempt to add duplicate cache level:%d\n",
> +				cache_attrs->level);
> +			return;
> +		}
> +	}
> +
> +	if (!node->cache_dev)
> +		node_init_cache_dev(node);
> +	if (!node->cache_dev)
> +		return;
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info)
> +		return;
> +
> +	dev = &info->dev;
> +	dev->parent = node->cache_dev;
> +	dev->release = node_cacheinfo_release;
> +	dev->groups = cache_groups;
> +	if (dev_set_name(dev, "index%d", cache_attrs->level))
> +		goto free_cache;
> +
> +	info->cache_attrs = *cache_attrs;
> +	if (device_register(dev)) {
> +		dev_warn(&node->dev, "failed to add cache level:%d\n",
> +			 cache_attrs->level);
> +		goto free_name;
> +	}
> +	pm_runtime_no_callbacks(dev);
> +	list_add_tail(&info->node, &node->cache_attrs);
> +	return;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free_cache:
> +	kfree(info);
> +}
> +
> +static void node_remove_caches(struct node *node)
> +{
> +	struct node_cache_info *info, *next;
> +
> +	if (!node->cache_dev)
> +		return;
> +
> +	list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
> +		list_del(&info->node);
> +		device_unregister(&info->dev);
> +	}
> +	device_unregister(node->cache_dev);
> +}
> +
> +static void node_init_caches(unsigned int nid)
> +{
> +	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
> +}
> +#else
> +static void node_init_caches(unsigned int nid) { }
> +static void node_remove_caches(struct node *node) { }
>  #endif
>  
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
> @@ -489,6 +640,7 @@ void unregister_node(struct node *node)
>  {
>  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
>  	node_remove_accesses(node);
> +	node_remove_caches(node);
>  	device_unregister(&node->dev);
>  }
>  
> @@ -781,6 +933,7 @@ int __register_one_node(int nid)
>  	INIT_LIST_HEAD(&node_devices[nid]->access_list);
>  	/* initialize work queue for memory hot plug */
>  	init_node_hugetlb_work(nid);
> +	node_init_caches(nid);
>  
>  	return error;
>  }
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 2db077363d9c..842e4ab2ae6d 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -37,6 +37,36 @@ struct node_hmem_attrs {
>  };
>  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>  			 unsigned access);
> +
> +enum cache_associativity {
> +	NODE_CACHE_DIRECT_MAP,
> +	NODE_CACHE_INDEXED,
> +	NODE_CACHE_OTHER,
> +};
> +
> +enum cache_write_policy {
> +	NODE_CACHE_WRITE_BACK,
> +	NODE_CACHE_WRITE_THROUGH,
> +	NODE_CACHE_WRITE_OTHER,
> +};
> +
> +/**
> + * struct node_cache_attrs - system memory caching attributes
> + *
> + * @associativity:	The ways memory blocks may be placed in cache
> + * @write_policy:	Write back or write through policy
> + * @size:		Total size of cache in bytes
> + * @line_size:		Number of bytes fetched on a cache miss
> + * @level:		Represents the cache hierarchy level
> + */
> +struct node_cache_attrs {
> +	enum cache_associativity associativity;
> +	enum cache_write_policy write_policy;
> +	u64 size;
> +	u16 line_size;
> +	u8  level;
> +};
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
>  #endif
>  
>  struct node {
> @@ -45,6 +75,10 @@ struct node {
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>  	struct work_struct	node_work;
>  #endif
> +#ifdef CONFIG_HMEM_REPORTING
> +	struct list_head cache_attrs;
> +	struct device *cache_dev;
> +#endif
>  };
>  
>  struct memory_block;



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 07/10] acpi/hmat: Register performance attributes
  2019-01-24 23:07 ` [PATCHv5 07/10] acpi/hmat: Register performance attributes Keith Busch
@ 2019-02-06 12:24     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:21 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Register the local attached performace access attributes with the memory's
performance

> node if HMAT provides the locality table. While HMAT does make it possible
> to know performance for all possible initiator-target pairings, we export
> only the local and matching pairings at this time.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/acpi/hmat/Kconfig |  1 +
>  drivers/acpi/hmat/hmat.c  | 14 ++++++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> index c9637e2e7514..08e972ead159 100644
> --- a/drivers/acpi/hmat/Kconfig
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -2,6 +2,7 @@
>  config ACPI_HMAT
>  	bool "ACPI Heterogeneous Memory Attribute Table Support"
>  	depends on ACPI_NUMA
> +	select HMEM_REPORTING
>  	help
>  	 If set, this option causes the kernel to set the memory NUMA node
>  	 relationships and access attributes in accordance with ACPI HMAT
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 85fd835c2e23..917e6122b3f0 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -430,6 +430,19 @@ static __init void hmat_register_target_initiators(struct memory_target *target)
>  		hmat_register_if_local(target, initiator);
>  }
>  
> +static __init void hmat_register_target_perf(struct memory_target *target)
> +{
> +	unsigned mem_nid = pxm_to_node(target->memory_pxm);
> +	struct node_hmem_attrs hmem_attrs = {
> +		.read_bandwidth	= target->read_bandwidth,
> +		.write_bandwidth= target->write_bandwidth,
> +		.read_latency	= target->read_latency,
> +		.write_latency	= target->write_latency,
> +	};
> +
> +	node_set_perf_attrs(mem_nid, &hmem_attrs, 0);
> +}
> +
>  static __init void hmat_register_targets(void)
>  {
>  	struct memory_target *target, *tnext;
> @@ -439,6 +452,7 @@ static __init void hmat_register_targets(void)
>  	list_for_each_entry_safe(target, tnext, &targets, node) {
>  		list_del(&target->node);
>  		hmat_register_target_initiators(target);
> +		hmat_register_target_perf(target);
>  		kfree(target);
>  	}
>  

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 07/10] acpi/hmat: Register performance attributes
@ 2019-02-06 12:24     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:21 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Register the local attached performace access attributes with the memory's
performance

> node if HMAT provides the locality table. While HMAT does make it possible
> to know performance for all possible initiator-target pairings, we export
> only the local and matching pairings at this time.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/acpi/hmat/Kconfig |  1 +
>  drivers/acpi/hmat/hmat.c  | 14 ++++++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> index c9637e2e7514..08e972ead159 100644
> --- a/drivers/acpi/hmat/Kconfig
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -2,6 +2,7 @@
>  config ACPI_HMAT
>  	bool "ACPI Heterogeneous Memory Attribute Table Support"
>  	depends on ACPI_NUMA
> +	select HMEM_REPORTING
>  	help
>  	 If set, this option causes the kernel to set the memory NUMA node
>  	 relationships and access attributes in accordance with ACPI HMAT
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 85fd835c2e23..917e6122b3f0 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -430,6 +430,19 @@ static __init void hmat_register_target_initiators(struct memory_target *target)
>  		hmat_register_if_local(target, initiator);
>  }
>  
> +static __init void hmat_register_target_perf(struct memory_target *target)
> +{
> +	unsigned mem_nid = pxm_to_node(target->memory_pxm);
> +	struct node_hmem_attrs hmem_attrs = {
> +		.read_bandwidth	= target->read_bandwidth,
> +		.write_bandwidth= target->write_bandwidth,
> +		.read_latency	= target->read_latency,
> +		.write_latency	= target->write_latency,
> +	};
> +
> +	node_set_perf_attrs(mem_nid, &hmem_attrs, 0);
> +}
> +
>  static __init void hmat_register_targets(void)
>  {
>  	struct memory_target *target, *tnext;
> @@ -439,6 +452,7 @@ static __init void hmat_register_targets(void)
>  	list_for_each_entry_safe(target, tnext, &targets, node) {
>  		list_del(&target->node);
>  		hmat_register_target_initiators(target);
> +		hmat_register_target_perf(target);
>  		kfree(target);
>  	}
>  



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 05/10] acpi/hmat: Register processor domain to its memory
  2019-01-24 23:07 ` [PATCHv5 05/10] acpi/hmat: Register processor domain to its memory Keith Busch
@ 2019-02-06 12:26     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:19 -0700
Keith Busch <keith.busch@intel.com> wrote:

> If the HMAT Subsystem Address Range provides a valid processor proximity
> domain for a memory domain, or a processor domain matches the performance
> access of the valid processor proximity domain, register the memory
> target with that initiator so this relationship will be visible under
> the node's sysfs directory.
> 
> Since HMAT requires valid address ranges have an equivalent SRAT entry,
> verify each memory target satisfies this requirement.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
A few comments inilne.

Thanks,

Jonathan

> ---
>  drivers/acpi/hmat/hmat.c | 310 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 310 insertions(+)
> 
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 1741bf30d87f..85fd835c2e23 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -16,6 +16,91 @@
>  #include <linux/node.h>
>  #include <linux/sysfs.h>
>  
> +static __initdata LIST_HEAD(targets);
> +static __initdata LIST_HEAD(initiators);
> +static __initdata LIST_HEAD(localities);
> +
> +struct memory_target {
> +	struct list_head node;
> +	unsigned int memory_pxm;
> +	unsigned int processor_pxm;
> +	unsigned int read_bandwidth;
> +	unsigned int write_bandwidth;
> +	unsigned int read_latency;
> +	unsigned int write_latency;
> +};
> +
> +struct memory_initiator {
> +	struct list_head node;
> +	unsigned int processor_pxm;
> +};
> +
> +struct memory_locality {
> +	struct list_head node;
> +	struct acpi_hmat_locality *hmat_loc;
> +};
> +
> +static __init struct memory_initiator *find_mem_initiator(unsigned int cpu_pxm)
> +{
> +	struct memory_initiator *intitator;
> +
> +	list_for_each_entry(intitator, &initiators, node)
> +		if (intitator->processor_pxm == cpu_pxm)
> +			return intitator;
> +	return NULL;
> +}
> +
> +static __init struct memory_target *find_mem_target(unsigned int mem_pxm)
> +{
> +	struct memory_target *target;
> +
> +	list_for_each_entry(target, &targets, node)
> +		if (target->memory_pxm == mem_pxm)
> +			return target;
> +	return NULL;
> +}
> +
> +static __init struct memory_initiator *alloc_memory_initiator(
> +							unsigned int cpu_pxm)
> +{
> +	struct memory_initiator *intitator;
> +
> +	if (pxm_to_node(cpu_pxm) == NUMA_NO_NODE)
> +		return NULL;
> +
> +	intitator = find_mem_initiator(cpu_pxm);
> +	if (intitator)
> +		return intitator;
> +
> +	intitator = kzalloc(sizeof(*intitator), GFP_KERNEL);
> +	if (!intitator)
> +		return NULL;
> +
> +	intitator->processor_pxm = cpu_pxm;
> +	list_add_tail(&intitator->node, &initiators);
> +	return intitator;
> +}
> +
> +static __init void alloc_memory_target(unsigned int mem_pxm)
> +{
> +	struct memory_target *target;
> +
> +	if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
> +		return;
> +
> +	target = find_mem_target(mem_pxm);
> +	if (target)
> +		return;
> +
> +	target = kzalloc(sizeof(*target), GFP_KERNEL);
> +	if (!target)
> +		return;
> +
> +	target->memory_pxm = mem_pxm;
> +	target->processor_pxm = PXM_INVAL;
> +	list_add_tail(&target->node, &targets);
> +}
> +
>  static __init const char *hmat_data_type(u8 type)
>  {
>  	switch (type) {
> @@ -52,13 +137,45 @@ static __init const char *hmat_data_type_suffix(u8 type)
>  	};
>  }
>  
> +static __init void hmat_update_target_access(struct memory_target *target,
> +                                             u8 type, u32 value)
> +{
> +	switch (type) {
> +	case ACPI_HMAT_ACCESS_LATENCY:
> +		target->read_latency = value;
> +		target->write_latency = value;
> +		break;
> +	case ACPI_HMAT_READ_LATENCY:
> +		target->read_latency = value;
> +		break;
> +	case ACPI_HMAT_WRITE_LATENCY:
> +		target->write_latency = value;
> +		break;
> +	case ACPI_HMAT_ACCESS_BANDWIDTH:
> +		target->read_bandwidth = value;
> +		target->write_bandwidth = value;
> +		break;
> +	case ACPI_HMAT_READ_BANDWIDTH:
> +		target->read_bandwidth = value;
> +		break;
> +	case ACPI_HMAT_WRITE_BANDWIDTH:
> +		target->write_bandwidth = value;
> +		break;
> +	default:
> +		break;
> +	};
> +}
> +
>  static __init int hmat_parse_locality(union acpi_subtable_headers *header,
>  				      const unsigned long end)
>  {
>  	struct acpi_hmat_locality *hmat_loc = (void *)header;
> +	struct memory_target *target;
> +	struct memory_initiator *initiator;
>  	unsigned int init, targ, total_size, ipds, tpds;
>  	u32 *inits, *targs, value;
>  	u16 *entries;
> +	bool report = false;
>  	u8 type;
>  
>  	if (hmat_loc->header.length < sizeof(*hmat_loc)) {
> @@ -82,16 +199,42 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
>  		hmat_loc->flags, hmat_data_type(type), ipds, tpds,
>  		hmat_loc->entry_base_unit);
>  
> +	/* Don't report performance of memory side caches */
> +	switch (hmat_loc->flags & ACPI_HMAT_MEMORY_HIERARCHY) {
> +	case ACPI_HMAT_MEMORY:
> +	case ACPI_HMAT_LAST_LEVEL_CACHE:

Both can be true under ACPI 6.2 do we actually want to report them both if
they are both there?

> +		report = true;
> +		break;
> +	default:
> +		break;
> +	}
> +
>  	inits = (u32 *)(hmat_loc + 1);
>  	targs = &inits[ipds];
>  	entries = (u16 *)(&targs[tpds]);
>  	for (init = 0; init < ipds; init++) {
> +		initiator = alloc_memory_initiator(inits[init]);
Error handling?

>  		for (targ = 0; targ < tpds; targ++) {
>  			value = entries[init * tpds + targ];
>  			value = (value * hmat_loc->entry_base_unit) / 10;
>  			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
>  				inits[init], targs[targ], value,
>  				hmat_data_type_suffix(type));
> +
> +			target = find_mem_target(targs[targ]);
> +			if (target && report &&
> +			    target->processor_pxm == initiator->processor_pxm)
> +				hmat_update_target_access(target, type, value);
> +		}
> +	}
> +
> +	if (report) {
> +		struct memory_locality *loc;
> +
> +		loc = kzalloc(sizeof(*loc), GFP_KERNEL);
> +		if (loc) {
> +			loc->hmat_loc = hmat_loc;
> +			list_add_tail(&loc->node, &localities);
>  		}

Error handling for that memory alloc failing?  Obviously it's unlikely
to happen, but nice to handle it anyway.

>  	}
>  
> @@ -122,16 +265,35 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
>  					   const unsigned long end)
>  {
>  	struct acpi_hmat_address_range *spa = (void *)header;
> +	struct memory_target *target = NULL;
>  
>  	if (spa->header.length != sizeof(*spa)) {
>  		pr_debug("HMAT: Unexpected address range header length: %d\n",
>  			 spa->header.length);
>  		return -EINVAL;
>  	}
> +

Might as well tidy that to the right patch.

>  	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
>  		spa->physical_address_base, spa->physical_address_length,
>  		spa->flags, spa->processor_PD, spa->memory_PD);
>  
> +	if (spa->flags & ACPI_HMAT_MEMORY_PD_VALID) {
> +		target = find_mem_target(spa->memory_PD);
> +		if (!target) {
> +			pr_debug("HMAT: Memory Domain missing from SRAT\n");
> +			return -EINVAL;
> +		}
> +	}
> +	if (target && spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
> +		int p_node = pxm_to_node(spa->processor_PD);
> +
> +		if (p_node == NUMA_NO_NODE) {
> +			pr_debug("HMAT: Invalid Processor Domain\n");
> +			return -EINVAL;
> +		}
> +		target->processor_pxm = p_node;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -155,6 +317,142 @@ static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
>  	}
>  }
>  
> +static __init int srat_parse_mem_affinity(union acpi_subtable_headers *header,
> +					  const unsigned long end)
> +{
> +	struct acpi_srat_mem_affinity *ma = (void *)header;
> +
> +	if (!ma)
> +		return -EINVAL;
> +	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
> +		return 0;
> +	alloc_memory_target(ma->proximity_domain);
> +	return 0;
> +}
> +
> +static __init bool hmat_is_local(struct memory_target *target,
> +                                 u8 type, u32 value)
> +{
> +	switch (type) {
> +	case ACPI_HMAT_ACCESS_LATENCY:
> +		return value == target->read_latency &&
> +		       value == target->write_latency;
> +	case ACPI_HMAT_READ_LATENCY:
> +		return value == target->read_latency;
> +	case ACPI_HMAT_WRITE_LATENCY:
> +		return value == target->write_latency;
> +	case ACPI_HMAT_ACCESS_BANDWIDTH:
> +		return value == target->read_bandwidth &&
> +		       value == target->write_bandwidth;
> +	case ACPI_HMAT_READ_BANDWIDTH:
> +		return value == target->read_bandwidth;
> +	case ACPI_HMAT_WRITE_BANDWIDTH:
> +		return value == target->write_bandwidth;
> +	default:
> +		return true;
> +	};
> +}
> +
> +static bool hmat_is_local_initiator(struct memory_target *target,
> +				    struct memory_initiator *initiator,
> +				    struct acpi_hmat_locality *hmat_loc)
> +{
> +	unsigned int ipds, tpds, i, idx = 0, tdx = 0;
> +	u32 *inits, *targs, value;
> +	u16 *entries;
> +
> +	ipds = hmat_loc->number_of_initiator_Pds;
> +	tpds = hmat_loc->number_of_target_Pds;
> +	inits = (u32 *)(hmat_loc + 1);
> +	targs = &inits[ipds];
> +	entries = (u16 *)(&targs[tpds]);
As earlier, I'd prefer not having indexes off the end of arrays.
Clearer to my eye to just have explicit pointer maths.

> +
> +	for (i = 0; i < ipds; i++) {
> +		if (inits[i] == initiator->processor_pxm) {
> +			idx = i;
> +			break;
> +		}
> +	}
> +
> +	if (i == ipds)
> +		return false;
> +
> +	for (i = 0; i < tpds; i++) {
> +		if (targs[i] == target->memory_pxm) {
> +			tdx = i;
> +			break;
> +		}
> +	}
> +	if (i == tpds)
> +		return false;
> +
> +	value = entries[idx * tpds + tdx];
> +	value = (value * hmat_loc->entry_base_unit) / 10;
Just noticed, this might well overflow.  entry_base_unit is 8 bytes long.

> +
> +	return hmat_is_local(target, hmat_loc->data_type, value);
> +}
> +
> +static __init void hmat_register_if_local(struct memory_target *target,
> +					  struct memory_initiator *initiator)
> +{
> +	unsigned int mem_nid, cpu_nid;
> +	struct memory_locality *loc;
> +
> +	if (initiator->processor_pxm == target->processor_pxm)
> +		return;
> +
> +	list_for_each_entry(loc, &localities, node)
> +		if (!hmat_is_local_initiator(target, initiator, loc->hmat_loc))
> +			return;
> +
> +	mem_nid = pxm_to_node(target->memory_pxm);
> +	cpu_nid = pxm_to_node(initiator->processor_pxm);
> +	register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
> +}
> +
> +static __init void hmat_register_target_initiators(struct memory_target *target)
> +{
> +	struct memory_initiator *initiator;
> +	unsigned int mem_nid, cpu_nid;
> +
> +	if (target->processor_pxm == PXM_INVAL)
> +		return;
> +
> +	mem_nid = pxm_to_node(target->memory_pxm);
> +	cpu_nid = pxm_to_node(target->processor_pxm);
> +	if (register_memory_node_under_compute_node(mem_nid, cpu_nid, 0))

As mentioned in previous patch, I think this can register devices
that aren't freed in the error path... 

In general I think the error handling needs another look.
In particular making sure we get helpful error messages for likely
table errors.

> +		return;
> +
> +	if (list_empty(&localities))
> +		return;
> +
> +	list_for_each_entry(initiator, &initiators, node)
> +		hmat_register_if_local(target, initiator);
> +}
> +
> +static __init void hmat_register_targets(void)
> +{
> +	struct memory_target *target, *tnext;
> +	struct memory_locality *loc, *lnext;
> +	struct memory_initiator *intitator, *inext;
> +
> +	list_for_each_entry_safe(target, tnext, &targets, node) {
> +		list_del(&target->node);
> +		hmat_register_target_initiators(target);
> +		kfree(target);
> +	}
> +
> +	list_for_each_entry_safe(intitator, inext, &initiators, node) {
> +		list_del(&intitator->node);
> +		kfree(intitator);
> +	}
> +
> +	list_for_each_entry_safe(loc, lnext, &localities, node) {
> +		list_del(&loc->node);
> +		kfree(loc);
> +	}
> +}
> +
>  static __init int hmat_init(void)
>  {
>  	struct acpi_table_header *tbl;
> @@ -164,6 +462,17 @@ static __init int hmat_init(void)
>  	if (srat_disabled())
>  		return 0;
>  
> +	status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
> +	if (ACPI_FAILURE(status))
> +		return 0;
> +
> +	if (acpi_table_parse_entries(ACPI_SIG_SRAT,
> +				sizeof(struct acpi_table_srat),
> +				ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> +				srat_parse_mem_affinity, 0) < 0)
> +		goto out_put;
> +	acpi_put_table(tbl);
> +
>  	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
>  	if (ACPI_FAILURE(status))
>  		return 0;
> @@ -174,6 +483,7 @@ static __init int hmat_init(void)
>  					     hmat_parse_subtable, 0) < 0)
>  			goto out_put;
>  	}
> +	hmat_register_targets();
>  out_put:
>  	acpi_put_table(tbl);
>  	return 0;

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 05/10] acpi/hmat: Register processor domain to its memory
@ 2019-02-06 12:26     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:19 -0700
Keith Busch <keith.busch@intel.com> wrote:

> If the HMAT Subsystem Address Range provides a valid processor proximity
> domain for a memory domain, or a processor domain matches the performance
> access of the valid processor proximity domain, register the memory
> target with that initiator so this relationship will be visible under
> the node's sysfs directory.
> 
> Since HMAT requires valid address ranges have an equivalent SRAT entry,
> verify each memory target satisfies this requirement.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
A few comments inilne.

Thanks,

Jonathan

> ---
>  drivers/acpi/hmat/hmat.c | 310 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 310 insertions(+)
> 
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 1741bf30d87f..85fd835c2e23 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -16,6 +16,91 @@
>  #include <linux/node.h>
>  #include <linux/sysfs.h>
>  
> +static __initdata LIST_HEAD(targets);
> +static __initdata LIST_HEAD(initiators);
> +static __initdata LIST_HEAD(localities);
> +
> +struct memory_target {
> +	struct list_head node;
> +	unsigned int memory_pxm;
> +	unsigned int processor_pxm;
> +	unsigned int read_bandwidth;
> +	unsigned int write_bandwidth;
> +	unsigned int read_latency;
> +	unsigned int write_latency;
> +};
> +
> +struct memory_initiator {
> +	struct list_head node;
> +	unsigned int processor_pxm;
> +};
> +
> +struct memory_locality {
> +	struct list_head node;
> +	struct acpi_hmat_locality *hmat_loc;
> +};
> +
> +static __init struct memory_initiator *find_mem_initiator(unsigned int cpu_pxm)
> +{
> +	struct memory_initiator *intitator;
> +
> +	list_for_each_entry(intitator, &initiators, node)
> +		if (intitator->processor_pxm == cpu_pxm)
> +			return intitator;
> +	return NULL;
> +}
> +
> +static __init struct memory_target *find_mem_target(unsigned int mem_pxm)
> +{
> +	struct memory_target *target;
> +
> +	list_for_each_entry(target, &targets, node)
> +		if (target->memory_pxm == mem_pxm)
> +			return target;
> +	return NULL;
> +}
> +
> +static __init struct memory_initiator *alloc_memory_initiator(
> +							unsigned int cpu_pxm)
> +{
> +	struct memory_initiator *intitator;
> +
> +	if (pxm_to_node(cpu_pxm) == NUMA_NO_NODE)
> +		return NULL;
> +
> +	intitator = find_mem_initiator(cpu_pxm);
> +	if (intitator)
> +		return intitator;
> +
> +	intitator = kzalloc(sizeof(*intitator), GFP_KERNEL);
> +	if (!intitator)
> +		return NULL;
> +
> +	intitator->processor_pxm = cpu_pxm;
> +	list_add_tail(&intitator->node, &initiators);
> +	return intitator;
> +}
> +
> +static __init void alloc_memory_target(unsigned int mem_pxm)
> +{
> +	struct memory_target *target;
> +
> +	if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
> +		return;
> +
> +	target = find_mem_target(mem_pxm);
> +	if (target)
> +		return;
> +
> +	target = kzalloc(sizeof(*target), GFP_KERNEL);
> +	if (!target)
> +		return;
> +
> +	target->memory_pxm = mem_pxm;
> +	target->processor_pxm = PXM_INVAL;
> +	list_add_tail(&target->node, &targets);
> +}
> +
>  static __init const char *hmat_data_type(u8 type)
>  {
>  	switch (type) {
> @@ -52,13 +137,45 @@ static __init const char *hmat_data_type_suffix(u8 type)
>  	};
>  }
>  
> +static __init void hmat_update_target_access(struct memory_target *target,
> +                                             u8 type, u32 value)
> +{
> +	switch (type) {
> +	case ACPI_HMAT_ACCESS_LATENCY:
> +		target->read_latency = value;
> +		target->write_latency = value;
> +		break;
> +	case ACPI_HMAT_READ_LATENCY:
> +		target->read_latency = value;
> +		break;
> +	case ACPI_HMAT_WRITE_LATENCY:
> +		target->write_latency = value;
> +		break;
> +	case ACPI_HMAT_ACCESS_BANDWIDTH:
> +		target->read_bandwidth = value;
> +		target->write_bandwidth = value;
> +		break;
> +	case ACPI_HMAT_READ_BANDWIDTH:
> +		target->read_bandwidth = value;
> +		break;
> +	case ACPI_HMAT_WRITE_BANDWIDTH:
> +		target->write_bandwidth = value;
> +		break;
> +	default:
> +		break;
> +	};
> +}
> +
>  static __init int hmat_parse_locality(union acpi_subtable_headers *header,
>  				      const unsigned long end)
>  {
>  	struct acpi_hmat_locality *hmat_loc = (void *)header;
> +	struct memory_target *target;
> +	struct memory_initiator *initiator;
>  	unsigned int init, targ, total_size, ipds, tpds;
>  	u32 *inits, *targs, value;
>  	u16 *entries;
> +	bool report = false;
>  	u8 type;
>  
>  	if (hmat_loc->header.length < sizeof(*hmat_loc)) {
> @@ -82,16 +199,42 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
>  		hmat_loc->flags, hmat_data_type(type), ipds, tpds,
>  		hmat_loc->entry_base_unit);
>  
> +	/* Don't report performance of memory side caches */
> +	switch (hmat_loc->flags & ACPI_HMAT_MEMORY_HIERARCHY) {
> +	case ACPI_HMAT_MEMORY:
> +	case ACPI_HMAT_LAST_LEVEL_CACHE:

Both can be true under ACPI 6.2 do we actually want to report them both if
they are both there?

> +		report = true;
> +		break;
> +	default:
> +		break;
> +	}
> +
>  	inits = (u32 *)(hmat_loc + 1);
>  	targs = &inits[ipds];
>  	entries = (u16 *)(&targs[tpds]);
>  	for (init = 0; init < ipds; init++) {
> +		initiator = alloc_memory_initiator(inits[init]);
Error handling?

>  		for (targ = 0; targ < tpds; targ++) {
>  			value = entries[init * tpds + targ];
>  			value = (value * hmat_loc->entry_base_unit) / 10;
>  			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
>  				inits[init], targs[targ], value,
>  				hmat_data_type_suffix(type));
> +
> +			target = find_mem_target(targs[targ]);
> +			if (target && report &&
> +			    target->processor_pxm == initiator->processor_pxm)
> +				hmat_update_target_access(target, type, value);
> +		}
> +	}
> +
> +	if (report) {
> +		struct memory_locality *loc;
> +
> +		loc = kzalloc(sizeof(*loc), GFP_KERNEL);
> +		if (loc) {
> +			loc->hmat_loc = hmat_loc;
> +			list_add_tail(&loc->node, &localities);
>  		}

Error handling for that memory alloc failing?  Obviously it's unlikely
to happen, but nice to handle it anyway.

>  	}
>  
> @@ -122,16 +265,35 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
>  					   const unsigned long end)
>  {
>  	struct acpi_hmat_address_range *spa = (void *)header;
> +	struct memory_target *target = NULL;
>  
>  	if (spa->header.length != sizeof(*spa)) {
>  		pr_debug("HMAT: Unexpected address range header length: %d\n",
>  			 spa->header.length);
>  		return -EINVAL;
>  	}
> +

Might as well tidy that to the right patch.

>  	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
>  		spa->physical_address_base, spa->physical_address_length,
>  		spa->flags, spa->processor_PD, spa->memory_PD);
>  
> +	if (spa->flags & ACPI_HMAT_MEMORY_PD_VALID) {
> +		target = find_mem_target(spa->memory_PD);
> +		if (!target) {
> +			pr_debug("HMAT: Memory Domain missing from SRAT\n");
> +			return -EINVAL;
> +		}
> +	}
> +	if (target && spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
> +		int p_node = pxm_to_node(spa->processor_PD);
> +
> +		if (p_node == NUMA_NO_NODE) {
> +			pr_debug("HMAT: Invalid Processor Domain\n");
> +			return -EINVAL;
> +		}
> +		target->processor_pxm = p_node;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -155,6 +317,142 @@ static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
>  	}
>  }
>  
> +static __init int srat_parse_mem_affinity(union acpi_subtable_headers *header,
> +					  const unsigned long end)
> +{
> +	struct acpi_srat_mem_affinity *ma = (void *)header;
> +
> +	if (!ma)
> +		return -EINVAL;
> +	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
> +		return 0;
> +	alloc_memory_target(ma->proximity_domain);
> +	return 0;
> +}
> +
> +static __init bool hmat_is_local(struct memory_target *target,
> +                                 u8 type, u32 value)
> +{
> +	switch (type) {
> +	case ACPI_HMAT_ACCESS_LATENCY:
> +		return value == target->read_latency &&
> +		       value == target->write_latency;
> +	case ACPI_HMAT_READ_LATENCY:
> +		return value == target->read_latency;
> +	case ACPI_HMAT_WRITE_LATENCY:
> +		return value == target->write_latency;
> +	case ACPI_HMAT_ACCESS_BANDWIDTH:
> +		return value == target->read_bandwidth &&
> +		       value == target->write_bandwidth;
> +	case ACPI_HMAT_READ_BANDWIDTH:
> +		return value == target->read_bandwidth;
> +	case ACPI_HMAT_WRITE_BANDWIDTH:
> +		return value == target->write_bandwidth;
> +	default:
> +		return true;
> +	};
> +}
> +
> +static bool hmat_is_local_initiator(struct memory_target *target,
> +				    struct memory_initiator *initiator,
> +				    struct acpi_hmat_locality *hmat_loc)
> +{
> +	unsigned int ipds, tpds, i, idx = 0, tdx = 0;
> +	u32 *inits, *targs, value;
> +	u16 *entries;
> +
> +	ipds = hmat_loc->number_of_initiator_Pds;
> +	tpds = hmat_loc->number_of_target_Pds;
> +	inits = (u32 *)(hmat_loc + 1);
> +	targs = &inits[ipds];
> +	entries = (u16 *)(&targs[tpds]);
As earlier, I'd prefer not having indexes off the end of arrays.
Clearer to my eye to just have explicit pointer maths.

> +
> +	for (i = 0; i < ipds; i++) {
> +		if (inits[i] == initiator->processor_pxm) {
> +			idx = i;
> +			break;
> +		}
> +	}
> +
> +	if (i == ipds)
> +		return false;
> +
> +	for (i = 0; i < tpds; i++) {
> +		if (targs[i] == target->memory_pxm) {
> +			tdx = i;
> +			break;
> +		}
> +	}
> +	if (i == tpds)
> +		return false;
> +
> +	value = entries[idx * tpds + tdx];
> +	value = (value * hmat_loc->entry_base_unit) / 10;
Just noticed, this might well overflow.  entry_base_unit is 8 bytes long.

> +
> +	return hmat_is_local(target, hmat_loc->data_type, value);
> +}
> +
> +static __init void hmat_register_if_local(struct memory_target *target,
> +					  struct memory_initiator *initiator)
> +{
> +	unsigned int mem_nid, cpu_nid;
> +	struct memory_locality *loc;
> +
> +	if (initiator->processor_pxm == target->processor_pxm)
> +		return;
> +
> +	list_for_each_entry(loc, &localities, node)
> +		if (!hmat_is_local_initiator(target, initiator, loc->hmat_loc))
> +			return;
> +
> +	mem_nid = pxm_to_node(target->memory_pxm);
> +	cpu_nid = pxm_to_node(initiator->processor_pxm);
> +	register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
> +}
> +
> +static __init void hmat_register_target_initiators(struct memory_target *target)
> +{
> +	struct memory_initiator *initiator;
> +	unsigned int mem_nid, cpu_nid;
> +
> +	if (target->processor_pxm == PXM_INVAL)
> +		return;
> +
> +	mem_nid = pxm_to_node(target->memory_pxm);
> +	cpu_nid = pxm_to_node(target->processor_pxm);
> +	if (register_memory_node_under_compute_node(mem_nid, cpu_nid, 0))

As mentioned in previous patch, I think this can register devices
that aren't freed in the error path... 

In general I think the error handling needs another look.
In particular making sure we get helpful error messages for likely
table errors.

> +		return;
> +
> +	if (list_empty(&localities))
> +		return;
> +
> +	list_for_each_entry(initiator, &initiators, node)
> +		hmat_register_if_local(target, initiator);
> +}
> +
> +static __init void hmat_register_targets(void)
> +{
> +	struct memory_target *target, *tnext;
> +	struct memory_locality *loc, *lnext;
> +	struct memory_initiator *intitator, *inext;
> +
> +	list_for_each_entry_safe(target, tnext, &targets, node) {
> +		list_del(&target->node);
> +		hmat_register_target_initiators(target);
> +		kfree(target);
> +	}
> +
> +	list_for_each_entry_safe(intitator, inext, &initiators, node) {
> +		list_del(&intitator->node);
> +		kfree(intitator);
> +	}
> +
> +	list_for_each_entry_safe(loc, lnext, &localities, node) {
> +		list_del(&loc->node);
> +		kfree(loc);
> +	}
> +}
> +
>  static __init int hmat_init(void)
>  {
>  	struct acpi_table_header *tbl;
> @@ -164,6 +462,17 @@ static __init int hmat_init(void)
>  	if (srat_disabled())
>  		return 0;
>  
> +	status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
> +	if (ACPI_FAILURE(status))
> +		return 0;
> +
> +	if (acpi_table_parse_entries(ACPI_SIG_SRAT,
> +				sizeof(struct acpi_table_srat),
> +				ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> +				srat_parse_mem_affinity, 0) < 0)
> +		goto out_put;
> +	acpi_put_table(tbl);
> +
>  	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
>  	if (ACPI_FAILURE(status))
>  		return 0;
> @@ -174,6 +483,7 @@ static __init int hmat_init(void)
>  					     hmat_parse_subtable, 0) < 0)
>  			goto out_put;
>  	}
> +	hmat_register_targets();
>  out_put:
>  	acpi_put_table(tbl);
>  	return 0;



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-01-24 23:07 ` [PATCHv5 04/10] node: Link memory nodes to their compute nodes Keith Busch
@ 2019-02-06 12:26     ` Jonathan Cameron
  2019-02-06 12:26     ` Jonathan Cameron
  2019-02-07 11:35     ` Rafael J. Wysocki
  2 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:18 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Systems may be constructed with various specialized nodes. Some nodes
> may provide memory, some provide compute devices that access and use
> that memory, and others may provide both. Nodes that provide memory are
> referred to as memory targets, and nodes that can initiate memory access
> are referred to as memory initiators.
> 
> Memory targets will often have varying access characteristics from
> different initiators, and platforms may have ways to express those
> relationships. In preparation for these systems, provide interfaces for
> the kernel to export the memory relationship among different nodes memory
> targets and their initiators with symlinks to each other.
> 
> If a system provides access locality for each initiator-target pair, nodes
> may be grouped into ranked access classes relative to other nodes. The
> new interface allows a subsystem to register relationships of varying
> classes if available and desired to be exported.
> 
> A memory initiator may have multiple memory targets in the same access
> class. The target memory's initiators in a given class indicate the
> nodes access characteristics share the same performance relative to other
> linked initiator nodes. Each target within an initiator's access class,
> though, do not necessarily perform the same as each other.
> 
> A memory target node may have multiple memory initiators. All linked
> initiators in a target's class have the same access characteristics to
> that target.
> 
> The following example show the nodes' new sysfs hierarchy for a memory
> target node 'Y' with access class 0 from initiator node 'X':
> 
>   # symlinks -v /sys/devices/system/node/nodeX/access0/
>   relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
> 
>   # symlinks -v /sys/devices/system/node/nodeY/access0/
>   relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
> 
> The new attributes are added to the sysfs stable documentation.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>

A few comments inline.

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  25 ++++-
>  drivers/base/node.c                         | 142 +++++++++++++++++++++++++++-
>  include/linux/node.h                        |   7 +-
>  3 files changed, 171 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..fb843222a281 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:		December 2009
>  Contact:	Lee Schermerhorn <lee.schermerhorn@hp.com>
>  Description:
>  		The node's huge page size control/query attributes.
> -		See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +		See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:		/sys/devices/system/node/nodeX/accessY/
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The node's relationship to other nodes for access class "Y".
> +
> +What:		/sys/devices/system/node/nodeX/accessY/initiators/
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The directory containing symlinks to memory initiator
> +		nodes that have class "Y" access to this target node's
> +		memory. CPUs and other memory initiators in nodes not in
> +		the list accessing this node's memory may have different
> +		performance.

Also seems to contain the characteristics of those accesses.

> +
> +What:		/sys/devices/system/node/nodeX/classY/targets/
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The directory containing symlinks to memory targets that
> +		this initiator node has class "Y" access.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..6f4097680580 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/cpu.h>
>  #include <linux/device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
>  
> @@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev,
>  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>  
> +/**
> + * struct node_access_nodes - Access class device to hold user visible
> + * 			      relationships to other nodes.
> + * @dev:	Device for this memory access class
> + * @list_node:	List element in the node's access list
> + * @access:	The access class rank
> + */
> +struct node_access_nodes {
> +	struct device		dev;
> +	struct list_head	list_node;
> +	unsigned		access;
> +};
> +#define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
> +
> +static struct attribute *node_init_access_node_attrs[] = {
> +	NULL,
> +};
> +
> +static struct attribute *node_targ_access_node_attrs[] = {
> +	NULL,
> +};
> +
> +static const struct attribute_group initiators = {
> +	.name	= "initiators",
> +	.attrs	= node_init_access_node_attrs,
> +};
> +
> +static const struct attribute_group targets = {
> +	.name	= "targets",
> +	.attrs	= node_targ_access_node_attrs,
> +};
> +
> +static const struct attribute_group *node_access_node_groups[] = {
> +	&initiators,
> +	&targets,
> +	NULL,
> +};
> +
> +static void node_remove_accesses(struct node *node)
> +{
> +	struct node_access_nodes *c, *cnext;
> +
> +	list_for_each_entry_safe(c, cnext, &node->access_list, list_node) {
> +		list_del(&c->list_node);
> +		device_unregister(&c->dev);
> +	}
> +}
> +
> +static void node_access_release(struct device *dev)
> +{
> +	kfree(to_access_nodes(dev));
> +}
> +
> +static struct node_access_nodes *node_init_node_access(struct node *node,
> +						       unsigned access)
> +{
> +	struct node_access_nodes *access_node;
> +	struct device *dev;
> +
> +	list_for_each_entry(access_node, &node->access_list, list_node)
> +		if (access_node->access == access)
> +			return access_node;
> +
> +	access_node = kzalloc(sizeof(*access_node), GFP_KERNEL);
> +	if (!access_node)
> +		return NULL;
> +
> +	access_node->access = access;
> +	dev = &access_node->dev;
> +	dev->parent = &node->dev;
> +	dev->release = node_access_release;
> +	dev->groups = node_access_node_groups;
> +	if (dev_set_name(dev, "access%u", access))
> +		goto free;
> +
> +	if (device_register(dev))
> +		goto free_name;
> +
> +	pm_runtime_no_callbacks(dev);
> +	list_add_tail(&access_node->list_node, &node->access_list);
> +	return access_node;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free:
> +	kfree(access_node);
> +	return NULL;
> +}
> +
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>  static ssize_t node_read_meminfo(struct device *dev,
>  			struct device_attribute *attr, char *buf)
> @@ -340,7 +429,7 @@ static int register_node(struct node *node, int num)
>  void unregister_node(struct node *node)
>  {
>  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
> -
> +	node_remove_accesses(node);
>  	device_unregister(&node->dev);
>  }
>  
> @@ -372,6 +461,56 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
>  				 kobject_name(&node_devices[nid]->dev.kobj));
>  }
>  
> +/**
> + * register_memory_node_under_compute_node - link memory node to its compute
> + *					     node for a given access class.
> + * @mem_node:	Memory node number
> + * @cpu_node:	Cpu  node number
> + * @access:	Access class to register
> + *
> + * Description:
> + * 	For use with platforms that may have separate memory and compute nodes.
I would drop that first line as it also applies on systems where this isn't
true and will be there if we want hmat simply for the better stats.

> + * 	This function will export node relationships linking which memory
> + * 	initiator nodes can access memory targets at a given ranked access
> + * 	class.
> + */
> +int register_memory_node_under_compute_node(unsigned int mem_nid,
> +					    unsigned int cpu_nid,
> +					    unsigned access)
> +{
> +	struct node *init_node, *targ_node;
> +	struct node_access_nodes *initiator, *target;
> +	int ret;
> +
> +	if (!node_online(cpu_nid) || !node_online(mem_nid))
> +		return -ENODEV;

What do we do under memory/node hotplug?  More than likely that will
apply in such systems (it does in mine for starters).
Clearly to do the full story we would need _HMT support etc but
we can do the prebaked version by having hmat entries for nodes
that aren't online yet (like we do for SRAT).

Perhaps one for a follow up patch set.  However, I'd like an
pr_info to indicate that the node is listed but not online currently.

> +
> +	init_node = node_devices[cpu_nid];
> +	targ_node = node_devices[mem_nid];
> +	initiator = node_init_node_access(init_node, access);
> +	target = node_init_node_access(targ_node, access);
> +	if (!initiator || !target)
> +		return -ENOMEM;
If one of these fails and the other doesn't + the one that succeeded
did an init, don't we end up leaking a device here?  I'd expect this
function to not leave things hanging if it has an error. It should
unwind anything it has done.  It has been added to the list so
could be cleaned up later, but I'm not seeing that code. 

These only get cleaned up when the node is removed.

> +
> +	ret = sysfs_add_link_to_group(&initiator->dev.kobj, "targets",
> +				      &targ_node->dev.kobj,
> +				      dev_name(&targ_node->dev));
> +	if (ret)
> +		return ret;
> +
> +	ret = sysfs_add_link_to_group(&target->dev.kobj, "initiators",
> +				      &init_node->dev.kobj,
> +				      dev_name(&init_node->dev));
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> + err:
> +	sysfs_remove_link_from_group(&initiator->dev.kobj, "targets",
> +				     dev_name(&targ_node->dev));
> +	return ret;
> +}
> +
>  int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
>  {
>  	struct device *obj;
> @@ -580,6 +719,7 @@ int __register_one_node(int nid)
>  			register_cpu_under_node(cpu, nid);
>  	}
>  
> +	INIT_LIST_HEAD(&node_devices[nid]->access_list);
>  	/* initialize work queue for memory hot plug */
>  	init_node_hugetlb_work(nid);
>  
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 257bb3d6d014..f34688a203c1 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -17,11 +17,12 @@
>  
>  #include <linux/device.h>
>  #include <linux/cpumask.h>
> +#include <linux/list.h>
>  #include <linux/workqueue.h>
>  
>  struct node {
>  	struct device	dev;
> -
> +	struct list_head access_list;
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>  	struct work_struct	node_work;
>  #endif
> @@ -75,6 +76,10 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>  					   unsigned long phys_index);
>  
> +extern int register_memory_node_under_compute_node(unsigned int mem_nid,
> +						   unsigned int cpu_nid,
> +						   unsigned access);
> +
>  #ifdef CONFIG_HUGETLBFS
>  extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
>  					 node_registration_func_t unregister);

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
@ 2019-02-06 12:26     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:18 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Systems may be constructed with various specialized nodes. Some nodes
> may provide memory, some provide compute devices that access and use
> that memory, and others may provide both. Nodes that provide memory are
> referred to as memory targets, and nodes that can initiate memory access
> are referred to as memory initiators.
> 
> Memory targets will often have varying access characteristics from
> different initiators, and platforms may have ways to express those
> relationships. In preparation for these systems, provide interfaces for
> the kernel to export the memory relationship among different nodes memory
> targets and their initiators with symlinks to each other.
> 
> If a system provides access locality for each initiator-target pair, nodes
> may be grouped into ranked access classes relative to other nodes. The
> new interface allows a subsystem to register relationships of varying
> classes if available and desired to be exported.
> 
> A memory initiator may have multiple memory targets in the same access
> class. The target memory's initiators in a given class indicate the
> nodes access characteristics share the same performance relative to other
> linked initiator nodes. Each target within an initiator's access class,
> though, do not necessarily perform the same as each other.
> 
> A memory target node may have multiple memory initiators. All linked
> initiators in a target's class have the same access characteristics to
> that target.
> 
> The following example show the nodes' new sysfs hierarchy for a memory
> target node 'Y' with access class 0 from initiator node 'X':
> 
>   # symlinks -v /sys/devices/system/node/nodeX/access0/
>   relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
> 
>   # symlinks -v /sys/devices/system/node/nodeY/access0/
>   relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
> 
> The new attributes are added to the sysfs stable documentation.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>

A few comments inline.

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  25 ++++-
>  drivers/base/node.c                         | 142 +++++++++++++++++++++++++++-
>  include/linux/node.h                        |   7 +-
>  3 files changed, 171 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..fb843222a281 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:		December 2009
>  Contact:	Lee Schermerhorn <lee.schermerhorn@hp.com>
>  Description:
>  		The node's huge page size control/query attributes.
> -		See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +		See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:		/sys/devices/system/node/nodeX/accessY/
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The node's relationship to other nodes for access class "Y".
> +
> +What:		/sys/devices/system/node/nodeX/accessY/initiators/
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The directory containing symlinks to memory initiator
> +		nodes that have class "Y" access to this target node's
> +		memory. CPUs and other memory initiators in nodes not in
> +		the list accessing this node's memory may have different
> +		performance.

Also seems to contain the characteristics of those accesses.

> +
> +What:		/sys/devices/system/node/nodeX/classY/targets/
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The directory containing symlinks to memory targets that
> +		this initiator node has class "Y" access.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..6f4097680580 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/cpu.h>
>  #include <linux/device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
>  
> @@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev,
>  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>  
> +/**
> + * struct node_access_nodes - Access class device to hold user visible
> + * 			      relationships to other nodes.
> + * @dev:	Device for this memory access class
> + * @list_node:	List element in the node's access list
> + * @access:	The access class rank
> + */
> +struct node_access_nodes {
> +	struct device		dev;
> +	struct list_head	list_node;
> +	unsigned		access;
> +};
> +#define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
> +
> +static struct attribute *node_init_access_node_attrs[] = {
> +	NULL,
> +};
> +
> +static struct attribute *node_targ_access_node_attrs[] = {
> +	NULL,
> +};
> +
> +static const struct attribute_group initiators = {
> +	.name	= "initiators",
> +	.attrs	= node_init_access_node_attrs,
> +};
> +
> +static const struct attribute_group targets = {
> +	.name	= "targets",
> +	.attrs	= node_targ_access_node_attrs,
> +};
> +
> +static const struct attribute_group *node_access_node_groups[] = {
> +	&initiators,
> +	&targets,
> +	NULL,
> +};
> +
> +static void node_remove_accesses(struct node *node)
> +{
> +	struct node_access_nodes *c, *cnext;
> +
> +	list_for_each_entry_safe(c, cnext, &node->access_list, list_node) {
> +		list_del(&c->list_node);
> +		device_unregister(&c->dev);
> +	}
> +}
> +
> +static void node_access_release(struct device *dev)
> +{
> +	kfree(to_access_nodes(dev));
> +}
> +
> +static struct node_access_nodes *node_init_node_access(struct node *node,
> +						       unsigned access)
> +{
> +	struct node_access_nodes *access_node;
> +	struct device *dev;
> +
> +	list_for_each_entry(access_node, &node->access_list, list_node)
> +		if (access_node->access == access)
> +			return access_node;
> +
> +	access_node = kzalloc(sizeof(*access_node), GFP_KERNEL);
> +	if (!access_node)
> +		return NULL;
> +
> +	access_node->access = access;
> +	dev = &access_node->dev;
> +	dev->parent = &node->dev;
> +	dev->release = node_access_release;
> +	dev->groups = node_access_node_groups;
> +	if (dev_set_name(dev, "access%u", access))
> +		goto free;
> +
> +	if (device_register(dev))
> +		goto free_name;
> +
> +	pm_runtime_no_callbacks(dev);
> +	list_add_tail(&access_node->list_node, &node->access_list);
> +	return access_node;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free:
> +	kfree(access_node);
> +	return NULL;
> +}
> +
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>  static ssize_t node_read_meminfo(struct device *dev,
>  			struct device_attribute *attr, char *buf)
> @@ -340,7 +429,7 @@ static int register_node(struct node *node, int num)
>  void unregister_node(struct node *node)
>  {
>  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
> -
> +	node_remove_accesses(node);
>  	device_unregister(&node->dev);
>  }
>  
> @@ -372,6 +461,56 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
>  				 kobject_name(&node_devices[nid]->dev.kobj));
>  }
>  
> +/**
> + * register_memory_node_under_compute_node - link memory node to its compute
> + *					     node for a given access class.
> + * @mem_node:	Memory node number
> + * @cpu_node:	Cpu  node number
> + * @access:	Access class to register
> + *
> + * Description:
> + * 	For use with platforms that may have separate memory and compute nodes.
I would drop that first line as it also applies on systems where this isn't
true and will be there if we want hmat simply for the better stats.

> + * 	This function will export node relationships linking which memory
> + * 	initiator nodes can access memory targets at a given ranked access
> + * 	class.
> + */
> +int register_memory_node_under_compute_node(unsigned int mem_nid,
> +					    unsigned int cpu_nid,
> +					    unsigned access)
> +{
> +	struct node *init_node, *targ_node;
> +	struct node_access_nodes *initiator, *target;
> +	int ret;
> +
> +	if (!node_online(cpu_nid) || !node_online(mem_nid))
> +		return -ENODEV;

What do we do under memory/node hotplug?  More than likely that will
apply in such systems (it does in mine for starters).
Clearly to do the full story we would need _HMT support etc but
we can do the prebaked version by having hmat entries for nodes
that aren't online yet (like we do for SRAT).

Perhaps one for a follow up patch set.  However, I'd like an
pr_info to indicate that the node is listed but not online currently.

> +
> +	init_node = node_devices[cpu_nid];
> +	targ_node = node_devices[mem_nid];
> +	initiator = node_init_node_access(init_node, access);
> +	target = node_init_node_access(targ_node, access);
> +	if (!initiator || !target)
> +		return -ENOMEM;
If one of these fails and the other doesn't + the one that succeeded
did an init, don't we end up leaking a device here?  I'd expect this
function to not leave things hanging if it has an error. It should
unwind anything it has done.  It has been added to the list so
could be cleaned up later, but I'm not seeing that code. 

These only get cleaned up when the node is removed.

> +
> +	ret = sysfs_add_link_to_group(&initiator->dev.kobj, "targets",
> +				      &targ_node->dev.kobj,
> +				      dev_name(&targ_node->dev));
> +	if (ret)
> +		return ret;
> +
> +	ret = sysfs_add_link_to_group(&target->dev.kobj, "initiators",
> +				      &init_node->dev.kobj,
> +				      dev_name(&init_node->dev));
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> + err:
> +	sysfs_remove_link_from_group(&initiator->dev.kobj, "targets",
> +				     dev_name(&targ_node->dev));
> +	return ret;
> +}
> +
>  int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
>  {
>  	struct device *obj;
> @@ -580,6 +719,7 @@ int __register_one_node(int nid)
>  			register_cpu_under_node(cpu, nid);
>  	}
>  
> +	INIT_LIST_HEAD(&node_devices[nid]->access_list);
>  	/* initialize work queue for memory hot plug */
>  	init_node_hugetlb_work(nid);
>  
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 257bb3d6d014..f34688a203c1 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -17,11 +17,12 @@
>  
>  #include <linux/device.h>
>  #include <linux/cpumask.h>
> +#include <linux/list.h>
>  #include <linux/workqueue.h>
>  
>  struct node {
>  	struct device	dev;
> -
> +	struct list_head access_list;
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>  	struct work_struct	node_work;
>  #endif
> @@ -75,6 +76,10 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>  					   unsigned long phys_index);
>  
> +extern int register_memory_node_under_compute_node(unsigned int mem_nid,
> +						   unsigned int cpu_nid,
> +						   unsigned access);
> +
>  #ifdef CONFIG_HUGETLBFS
>  extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
>  					 node_registration_func_t unregister);



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory
  2019-01-24 23:07 ` [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory Keith Busch
@ 2019-02-06 12:28     ` Jonathan Cameron
  2019-02-06 12:28     ` Jonathan Cameron
  1 sibling, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:17 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Systems may provide different memory types and export this information
> in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
> tables provided by the platform and report the memory access and caching
> attributes to the kernel messages.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
Minor comments inline.

One question for reviewers in general. Should this be a lot 'louder' on
failures.

I'd really like the kernel to moan a lot on all occasions if we start getting
bad HMAT tables out there.  This feels to me too silent by default!

Jonathan
> ---
>  drivers/acpi/Kconfig       |   1 +
>  drivers/acpi/Makefile      |   1 +
>  drivers/acpi/hmat/Kconfig  |   8 ++
>  drivers/acpi/hmat/Makefile |   1 +
>  drivers/acpi/hmat/hmat.c   | 181 +++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 192 insertions(+)
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index 90ff0a47c12e..b377f970adfd 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -465,6 +465,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
>  	  If you are unsure what to do, do not enable this option.
>  
>  source "drivers/acpi/nfit/Kconfig"
> +source "drivers/acpi/hmat/Kconfig"
>  
>  source "drivers/acpi/apei/Kconfig"
>  source "drivers/acpi/dptf/Kconfig"
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index bb857421c2e8..5d361e4e3405 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -80,6 +80,7 @@ obj-$(CONFIG_ACPI_PROCESSOR)	+= processor.o
>  obj-$(CONFIG_ACPI)		+= container.o
>  obj-$(CONFIG_ACPI_THERMAL)	+= thermal.o
>  obj-$(CONFIG_ACPI_NFIT)		+= nfit/
> +obj-$(CONFIG_ACPI_HMAT)		+= hmat/
>  obj-$(CONFIG_ACPI)		+= acpi_memhotplug.o
>  obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>  obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> new file mode 100644
> index 000000000000..c9637e2e7514
> --- /dev/null
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config ACPI_HMAT
> +	bool "ACPI Heterogeneous Memory Attribute Table Support"
> +	depends on ACPI_NUMA
> +	help
> +	 If set, this option causes the kernel to set the memory NUMA node
> +	 relationships and access attributes in accordance with ACPI HMAT
> +	 (Heterogeneous Memory Attributes Table).
> diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile
> new file mode 100644
> index 000000000000..e909051d3d00
> --- /dev/null
> +++ b/drivers/acpi/hmat/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_ACPI_HMAT) := hmat.o
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> new file mode 100644
> index 000000000000..1741bf30d87f
> --- /dev/null
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -0,0 +1,181 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, Intel Corporation.
> + *
> + * Heterogeneous Memory Attributes Table (HMAT) representation
> + *
> + * This program parses and reports the platform's HMAT tables, and registers
> + * the applicable attributes with the node's interfaces.
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/device.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/node.h>
> +#include <linux/sysfs.h>
> +
> +static __init const char *hmat_data_type(u8 type)
> +{
> +	switch (type) {
> +	case ACPI_HMAT_ACCESS_LATENCY:
> +		return "Access Latency";
> +	case ACPI_HMAT_READ_LATENCY:
> +		return "Read Latency";
> +	case ACPI_HMAT_WRITE_LATENCY:
> +		return "Write Latency";
> +	case ACPI_HMAT_ACCESS_BANDWIDTH:
> +		return "Access Bandwidth";
> +	case ACPI_HMAT_READ_BANDWIDTH:
> +		return "Read Bandwidth";
> +	case ACPI_HMAT_WRITE_BANDWIDTH:
> +		return "Write Bandwidth";
> +	default:
> +		return "Reserved";
> +	};
> +}
> +
> +static __init const char *hmat_data_type_suffix(u8 type)
> +{
> +	switch (type) {
> +	case ACPI_HMAT_ACCESS_LATENCY:
> +	case ACPI_HMAT_READ_LATENCY:
> +	case ACPI_HMAT_WRITE_LATENCY:
> +		return " nsec";
> +	case ACPI_HMAT_ACCESS_BANDWIDTH:
> +	case ACPI_HMAT_READ_BANDWIDTH:
> +	case ACPI_HMAT_WRITE_BANDWIDTH:
> +		return " MB/s";
> +	default:
> +		return "";
> +	};
> +}
> +
> +static __init int hmat_parse_locality(union acpi_subtable_headers *header,
> +				      const unsigned long end)
> +{
> +	struct acpi_hmat_locality *hmat_loc = (void *)header;
> +	unsigned int init, targ, total_size, ipds, tpds;
> +	u32 *inits, *targs, value;
> +	u16 *entries;
> +	u8 type;
> +
> +	if (hmat_loc->header.length < sizeof(*hmat_loc)) {
> +		pr_debug("HMAT: Unexpected locality header length: %d\n",
> +			 hmat_loc->header.length);
> +		return -EINVAL;
> +	}
> +
> +	type = hmat_loc->data_type;
> +	ipds = hmat_loc->number_of_initiator_Pds;
> +	tpds = hmat_loc->number_of_target_Pds;
> +	total_size = sizeof(*hmat_loc) + sizeof(*entries) * ipds * tpds +
> +		     sizeof(*inits) * ipds + sizeof(*targs) * tpds;
> +	if (hmat_loc->header.length < total_size) {
> +		pr_debug("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
> +			 hmat_loc->header.length, total_size);
> +		return -EINVAL;
> +	}
> +
> +	pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
> +		hmat_loc->flags, hmat_data_type(type), ipds, tpds,
> +		hmat_loc->entry_base_unit);
> +
> +	inits = (u32 *)(hmat_loc + 1);
> +	targs = &inits[ipds];
This line is a bit of an oddity as it's indexing off the end of the data.
	targs = inits + ipds;
would be nicer to my mind as doesn't even hint that we are in inits still.


> +	entries = (u16 *)(&targs[tpds]);

As above I'd prefer we did the pointer arithmetic explicitly rather
than used an index off the end of the array.

> +	for (init = 0; init < ipds; init++) {
> +		for (targ = 0; targ < tpds; targ++) {
> +			value = entries[init * tpds + targ];
> +			value = (value * hmat_loc->entry_base_unit) / 10;
> +			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
> +				inits[init], targs[targ], value,
> +				hmat_data_type_suffix(type));

Worth checking at this early stage that the domains exist in SRAT?
+ screaming if they don't.
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static __init int hmat_parse_cache(union acpi_subtable_headers *header,
> +				   const unsigned long end)
> +{
> +	struct acpi_hmat_cache *cache = (void *)header;
> +	u32 attrs;
> +
> +	if (cache->header.length < sizeof(*cache)) {
> +		pr_debug("HMAT: Unexpected cache header length: %d\n",
> +			 cache->header.length);
> +		return -EINVAL;
> +	}
> +
> +	attrs = cache->cache_attributes;
> +	pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
> +		cache->memory_PD, cache->cache_size, attrs,
> +		cache->number_of_SMBIOShandles);

Can we sanity check those smbios handles actually match anything?

> +
> +	return 0;
> +}
> +
> +static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
> +					   const unsigned long end)
> +{
> +	struct acpi_hmat_address_range *spa = (void *)header;
> +
> +	if (spa->header.length != sizeof(*spa)) {
> +		pr_debug("HMAT: Unexpected address range header length: %d\n",
> +			 spa->header.length);

My gut feeling is that it's much more useful to make this always print rather
than debug.  Same with other error paths above.  Given the number of times
broken ACPI tables show up, it's nice to complain really loudly!

Perhaps others prefer to not do so though so I'll defer to subsystem norms.

> +		return -EINVAL;
> +	}
> +	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
> +		spa->physical_address_base, spa->physical_address_length,
> +		spa->flags, spa->processor_PD, spa->memory_PD);
> +
> +	return 0;
> +}
> +
> +static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
> +				      const unsigned long end)
> +{
> +	struct acpi_hmat_structure *hdr = (void *)header;
> +
> +	if (!hdr)
> +		return -EINVAL;
> +
> +	switch (hdr->type) {
> +	case ACPI_HMAT_TYPE_ADDRESS_RANGE:
> +		return hmat_parse_address_range(header, end);
> +	case ACPI_HMAT_TYPE_LOCALITY:
> +		return hmat_parse_locality(header, end);
> +	case ACPI_HMAT_TYPE_CACHE:
> +		return hmat_parse_cache(header, end);
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +static __init int hmat_init(void)
> +{
> +	struct acpi_table_header *tbl;
> +	enum acpi_hmat_type i;
> +	acpi_status status;
> +
> +	if (srat_disabled())
> +		return 0;
> +
> +	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
> +	if (ACPI_FAILURE(status))
> +		return 0;
> +
> +	for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
> +		if (acpi_table_parse_entries(ACPI_SIG_HMAT,
> +					     sizeof(struct acpi_table_hmat), i,
> +					     hmat_parse_subtable, 0) < 0)
> +			goto out_put;
> +	}
> +out_put:
> +	acpi_put_table(tbl);
> +	return 0;
> +}
> +subsys_initcall(hmat_init);

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory
@ 2019-02-06 12:28     ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, 24 Jan 2019 16:07:17 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Systems may provide different memory types and export this information
> in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
> tables provided by the platform and report the memory access and caching
> attributes to the kernel messages.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
Minor comments inline.

One question for reviewers in general. Should this be a lot 'louder' on
failures.

I'd really like the kernel to moan a lot on all occasions if we start getting
bad HMAT tables out there.  This feels to me too silent by default!

Jonathan
> ---
>  drivers/acpi/Kconfig       |   1 +
>  drivers/acpi/Makefile      |   1 +
>  drivers/acpi/hmat/Kconfig  |   8 ++
>  drivers/acpi/hmat/Makefile |   1 +
>  drivers/acpi/hmat/hmat.c   | 181 +++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 192 insertions(+)
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index 90ff0a47c12e..b377f970adfd 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -465,6 +465,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
>  	  If you are unsure what to do, do not enable this option.
>  
>  source "drivers/acpi/nfit/Kconfig"
> +source "drivers/acpi/hmat/Kconfig"
>  
>  source "drivers/acpi/apei/Kconfig"
>  source "drivers/acpi/dptf/Kconfig"
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index bb857421c2e8..5d361e4e3405 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -80,6 +80,7 @@ obj-$(CONFIG_ACPI_PROCESSOR)	+= processor.o
>  obj-$(CONFIG_ACPI)		+= container.o
>  obj-$(CONFIG_ACPI_THERMAL)	+= thermal.o
>  obj-$(CONFIG_ACPI_NFIT)		+= nfit/
> +obj-$(CONFIG_ACPI_HMAT)		+= hmat/
>  obj-$(CONFIG_ACPI)		+= acpi_memhotplug.o
>  obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>  obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> new file mode 100644
> index 000000000000..c9637e2e7514
> --- /dev/null
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config ACPI_HMAT
> +	bool "ACPI Heterogeneous Memory Attribute Table Support"
> +	depends on ACPI_NUMA
> +	help
> +	 If set, this option causes the kernel to set the memory NUMA node
> +	 relationships and access attributes in accordance with ACPI HMAT
> +	 (Heterogeneous Memory Attributes Table).
> diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile
> new file mode 100644
> index 000000000000..e909051d3d00
> --- /dev/null
> +++ b/drivers/acpi/hmat/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_ACPI_HMAT) := hmat.o
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> new file mode 100644
> index 000000000000..1741bf30d87f
> --- /dev/null
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -0,0 +1,181 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, Intel Corporation.
> + *
> + * Heterogeneous Memory Attributes Table (HMAT) representation
> + *
> + * This program parses and reports the platform's HMAT tables, and registers
> + * the applicable attributes with the node's interfaces.
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/device.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/node.h>
> +#include <linux/sysfs.h>
> +
> +static __init const char *hmat_data_type(u8 type)
> +{
> +	switch (type) {
> +	case ACPI_HMAT_ACCESS_LATENCY:
> +		return "Access Latency";
> +	case ACPI_HMAT_READ_LATENCY:
> +		return "Read Latency";
> +	case ACPI_HMAT_WRITE_LATENCY:
> +		return "Write Latency";
> +	case ACPI_HMAT_ACCESS_BANDWIDTH:
> +		return "Access Bandwidth";
> +	case ACPI_HMAT_READ_BANDWIDTH:
> +		return "Read Bandwidth";
> +	case ACPI_HMAT_WRITE_BANDWIDTH:
> +		return "Write Bandwidth";
> +	default:
> +		return "Reserved";
> +	};
> +}
> +
> +static __init const char *hmat_data_type_suffix(u8 type)
> +{
> +	switch (type) {
> +	case ACPI_HMAT_ACCESS_LATENCY:
> +	case ACPI_HMAT_READ_LATENCY:
> +	case ACPI_HMAT_WRITE_LATENCY:
> +		return " nsec";
> +	case ACPI_HMAT_ACCESS_BANDWIDTH:
> +	case ACPI_HMAT_READ_BANDWIDTH:
> +	case ACPI_HMAT_WRITE_BANDWIDTH:
> +		return " MB/s";
> +	default:
> +		return "";
> +	};
> +}
> +
> +static __init int hmat_parse_locality(union acpi_subtable_headers *header,
> +				      const unsigned long end)
> +{
> +	struct acpi_hmat_locality *hmat_loc = (void *)header;
> +	unsigned int init, targ, total_size, ipds, tpds;
> +	u32 *inits, *targs, value;
> +	u16 *entries;
> +	u8 type;
> +
> +	if (hmat_loc->header.length < sizeof(*hmat_loc)) {
> +		pr_debug("HMAT: Unexpected locality header length: %d\n",
> +			 hmat_loc->header.length);
> +		return -EINVAL;
> +	}
> +
> +	type = hmat_loc->data_type;
> +	ipds = hmat_loc->number_of_initiator_Pds;
> +	tpds = hmat_loc->number_of_target_Pds;
> +	total_size = sizeof(*hmat_loc) + sizeof(*entries) * ipds * tpds +
> +		     sizeof(*inits) * ipds + sizeof(*targs) * tpds;
> +	if (hmat_loc->header.length < total_size) {
> +		pr_debug("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
> +			 hmat_loc->header.length, total_size);
> +		return -EINVAL;
> +	}
> +
> +	pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
> +		hmat_loc->flags, hmat_data_type(type), ipds, tpds,
> +		hmat_loc->entry_base_unit);
> +
> +	inits = (u32 *)(hmat_loc + 1);
> +	targs = &inits[ipds];
This line is a bit of an oddity as it's indexing off the end of the data.
	targs = inits + ipds;
would be nicer to my mind as doesn't even hint that we are in inits still.


> +	entries = (u16 *)(&targs[tpds]);

As above I'd prefer we did the pointer arithmetic explicitly rather
than used an index off the end of the array.

> +	for (init = 0; init < ipds; init++) {
> +		for (targ = 0; targ < tpds; targ++) {
> +			value = entries[init * tpds + targ];
> +			value = (value * hmat_loc->entry_base_unit) / 10;
> +			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
> +				inits[init], targs[targ], value,
> +				hmat_data_type_suffix(type));

Worth checking at this early stage that the domains exist in SRAT?
+ screaming if they don't.
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static __init int hmat_parse_cache(union acpi_subtable_headers *header,
> +				   const unsigned long end)
> +{
> +	struct acpi_hmat_cache *cache = (void *)header;
> +	u32 attrs;
> +
> +	if (cache->header.length < sizeof(*cache)) {
> +		pr_debug("HMAT: Unexpected cache header length: %d\n",
> +			 cache->header.length);
> +		return -EINVAL;
> +	}
> +
> +	attrs = cache->cache_attributes;
> +	pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
> +		cache->memory_PD, cache->cache_size, attrs,
> +		cache->number_of_SMBIOShandles);

Can we sanity check those smbios handles actually match anything?

> +
> +	return 0;
> +}
> +
> +static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
> +					   const unsigned long end)
> +{
> +	struct acpi_hmat_address_range *spa = (void *)header;
> +
> +	if (spa->header.length != sizeof(*spa)) {
> +		pr_debug("HMAT: Unexpected address range header length: %d\n",
> +			 spa->header.length);

My gut feeling is that it's much more useful to make this always print rather
than debug.  Same with other error paths above.  Given the number of times
broken ACPI tables show up, it's nice to complain really loudly!

Perhaps others prefer to not do so though so I'll defer to subsystem norms.

> +		return -EINVAL;
> +	}
> +	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
> +		spa->physical_address_base, spa->physical_address_length,
> +		spa->flags, spa->processor_PD, spa->memory_PD);
> +
> +	return 0;
> +}
> +
> +static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
> +				      const unsigned long end)
> +{
> +	struct acpi_hmat_structure *hdr = (void *)header;
> +
> +	if (!hdr)
> +		return -EINVAL;
> +
> +	switch (hdr->type) {
> +	case ACPI_HMAT_TYPE_ADDRESS_RANGE:
> +		return hmat_parse_address_range(header, end);
> +	case ACPI_HMAT_TYPE_LOCALITY:
> +		return hmat_parse_locality(header, end);
> +	case ACPI_HMAT_TYPE_CACHE:
> +		return hmat_parse_cache(header, end);
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +static __init int hmat_init(void)
> +{
> +	struct acpi_table_header *tbl;
> +	enum acpi_hmat_type i;
> +	acpi_status status;
> +
> +	if (srat_disabled())
> +		return 0;
> +
> +	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
> +	if (ACPI_FAILURE(status))
> +		return 0;
> +
> +	for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
> +		if (acpi_table_parse_entries(ACPI_SIG_HMAT,
> +					     sizeof(struct acpi_table_hmat), i,
> +					     hmat_parse_subtable, 0) < 0)
> +			goto out_put;
> +	}
> +out_put:
> +	acpi_put_table(tbl);
> +	return 0;
> +}
> +subsys_initcall(hmat_init);



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
@ 2019-02-06 12:31   ` Jonathan Cameron
  2019-01-24 23:07 ` [PATCHv5 02/10] acpi: Add HMAT to generic parsing tables Keith Busch
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:31 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Thu, 24 Jan 2019 16:07:14 -0700
Keith Busch <keith.busch@intel.com> wrote:

> == Changes since v4 ==
> 
>   All public interfaces have kernel docs.
> 
>   Renamed "class" to "access", docs and changed logs updated
>   accordingly. (Rafael)
> 
>   The sysfs hierarchy is altered to put initiators and targets in their
>   own attribute group directories (Rafael).
> 
>   The node lists are removed. This feedback is in conflict with v1
>   feedback, but consensus wants to remove multi-value sysfs attributes,
>   which includes lists. We only have symlinks now, just like v1 provided.
> 
>   Documentation and code patches are combined such that the code
>   introducing new attributes and its documentation are in the same
>   patch. (Rafael and Dan).
> 
>   The performance attributes, bandwidth and latency, are moved into the
>   initiators directory. This should make it obvious for which node
>   access the attributes apply, which was previously ambiguous.
>   (Jonathan Cameron).
> 
>   The HMAT code selecting "local" initiators is substantially changed.
>   Only PXM's that have identical performance to the HMAT's processor PXM
>   in Address Range Structure are registered. This is to avoid considering
>   nodes identical when only one of several perf attributes are the same.
>   (Jonathan Cameron).
> 
>   Verbose variable naming. Examples include "initiator" and "target"
>   instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
>   "p". (Rafael)
> 
>   Compile fixes for when HMEM_REPORTING is not set. This is not a user
>   selectable config option, default 'n', and will have to be selected
>   by other config options that require it (Greg KH and Rafael).
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

Hi Keith,

Seems to be heading in the right direction to me... (though I personally
want to see the whole of HMAT expose, but meh, that seems unpopular :)

I've fired up a new test rig (someone pinched the fan on the previous one)
that I can make present pretty much anything to this code.

First up is a system with 4 nodes with cpu and local ddr [0-3] + 1 remote node
with just memory [4]. All the figures as you might expect between the nodes with
CPUs. The remote node has equal numbers from all the cpus.

First some general comments on places this doesn't work as my gut feeling
said it would...

I'm going to keep this somewhat vague on certain points as ACPI 6.3 should
be public any day now and I think it is fair to say we should take into
account any changes in there...
There is definitely one place the current patches won't work with 6.3, but
I'll point it out in a few days.  There may be others.

1) It seems this version added a hard dependence on having the memory node
   listed in the Memory Proximity Domain attribute structures.  I'm not 100%
   sure there is actually any requirement to have those structures. If you aren't
   using the hint bit, they don't convey any information.  It could be argued
   that they provide info on what is found in the other hmat entries, but there
   is little purpose as those entries are explicit in what the provide.
   (Given I didn't have any of these structures and things  worked fine with
    v4 it seems this is a new check).

   This is also somewhat inconsistent.
   a) If a given entry isn't there, we still get for example
      node4/access0/initiators/[read|write]_* but all values are 0.
      If we want to do the check you have it needs to not create the files in
      this case.  Whilst they have no meaning as there are no initiators, it
      is inconsistent to my mind.

   b) Having one "Memory Proximity Domain attribute structure" for node 4 linking
      it to node0 is sufficient to allow
      node4/access0/initiators/node0
      node4/access0/initiators/node1
      node4/access0/initiators/node2
      node4/access0/initiators/node3
      I think if we are going to enforce the presence of that structure then only
      the node0 link should exist.

2) Error handling could perhaps do to spit out some nasty warnings.
   If we have an entry for nodes that don't exist we shouldn't just fail silently,
   that's just one example I managed to trigger with minor table tweaking.

Personally I would just get rid of enforcing anything based on the presence of
that structure.

I'll send more focused comments on some of the individual patches.

Thanks,

Jonathan
   

> 
> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 354 ++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1344 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
@ 2019-02-06 12:31   ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 12:31 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Thu, 24 Jan 2019 16:07:14 -0700
Keith Busch <keith.busch@intel.com> wrote:

> == Changes since v4 ==
> 
>   All public interfaces have kernel docs.
> 
>   Renamed "class" to "access", docs and changed logs updated
>   accordingly. (Rafael)
> 
>   The sysfs hierarchy is altered to put initiators and targets in their
>   own attribute group directories (Rafael).
> 
>   The node lists are removed. This feedback is in conflict with v1
>   feedback, but consensus wants to remove multi-value sysfs attributes,
>   which includes lists. We only have symlinks now, just like v1 provided.
> 
>   Documentation and code patches are combined such that the code
>   introducing new attributes and its documentation are in the same
>   patch. (Rafael and Dan).
> 
>   The performance attributes, bandwidth and latency, are moved into the
>   initiators directory. This should make it obvious for which node
>   access the attributes apply, which was previously ambiguous.
>   (Jonathan Cameron).
> 
>   The HMAT code selecting "local" initiators is substantially changed.
>   Only PXM's that have identical performance to the HMAT's processor PXM
>   in Address Range Structure are registered. This is to avoid considering
>   nodes identical when only one of several perf attributes are the same.
>   (Jonathan Cameron).
> 
>   Verbose variable naming. Examples include "initiator" and "target"
>   instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
>   "p". (Rafael)
> 
>   Compile fixes for when HMEM_REPORTING is not set. This is not a user
>   selectable config option, default 'n', and will have to be selected
>   by other config options that require it (Greg KH and Rafael).
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

Hi Keith,

Seems to be heading in the right direction to me... (though I personally
want to see the whole of HMAT expose, but meh, that seems unpopular :)

I've fired up a new test rig (someone pinched the fan on the previous one)
that I can make present pretty much anything to this code.

First up is a system with 4 nodes with cpu and local ddr [0-3] + 1 remote node
with just memory [4]. All the figures as you might expect between the nodes with
CPUs. The remote node has equal numbers from all the cpus.

First some general comments on places this doesn't work as my gut feeling
said it would...

I'm going to keep this somewhat vague on certain points as ACPI 6.3 should
be public any day now and I think it is fair to say we should take into
account any changes in there...
There is definitely one place the current patches won't work with 6.3, but
I'll point it out in a few days.  There may be others.

1) It seems this version added a hard dependence on having the memory node
   listed in the Memory Proximity Domain attribute structures.  I'm not 100%
   sure there is actually any requirement to have those structures. If you aren't
   using the hint bit, they don't convey any information.  It could be argued
   that they provide info on what is found in the other hmat entries, but there
   is little purpose as those entries are explicit in what the provide.
   (Given I didn't have any of these structures and things  worked fine with
    v4 it seems this is a new check).

   This is also somewhat inconsistent.
   a) If a given entry isn't there, we still get for example
      node4/access0/initiators/[read|write]_* but all values are 0.
      If we want to do the check you have it needs to not create the files in
      this case.  Whilst they have no meaning as there are no initiators, it
      is inconsistent to my mind.

   b) Having one "Memory Proximity Domain attribute structure" for node 4 linking
      it to node0 is sufficient to allow
      node4/access0/initiators/node0
      node4/access0/initiators/node1
      node4/access0/initiators/node2
      node4/access0/initiators/node3
      I think if we are going to enforce the presence of that structure then only
      the node0 link should exist.

2) Error handling could perhaps do to spit out some nasty warnings.
   If we have an entry for nodes that don't exist we shouldn't just fail silently,
   that's just one example I managed to trigger with minor table tweaking.

Personally I would just get rid of enforcing anything based on the presence of
that structure.

I'll send more focused comments on some of the individual patches.

Thanks,

Jonathan
   

> 
> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 354 ++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1344 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory
  2019-02-06 12:28     ` Jonathan Cameron
  (?)
@ 2019-02-06 16:06     ` Keith Busch
  2019-02-06 16:39         ` Jonathan Cameron
  -1 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-02-06 16:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Wed, Feb 06, 2019 at 12:28:14PM +0000, Jonathan Cameron wrote:
> On Thu, 24 Jan 2019 16:07:17 -0700
> Keith Busch <keith.busch@intel.com> wrote:
> 
> > +	pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
> > +		hmat_loc->flags, hmat_data_type(type), ipds, tpds,
> > +		hmat_loc->entry_base_unit);
> > +
> > +	inits = (u32 *)(hmat_loc + 1);
> > +	targs = &inits[ipds];
>
> This line is a bit of an oddity as it's indexing off the end of the data.
> 	targs = inits + ipds;
> would be nicer to my mind as doesn't even hint that we are in inits still.
> 
> 
> > +	entries = (u16 *)(&targs[tpds]);

Sure, I can change these to addition rather than indexing. I have no
preference either way.

> As above I'd prefer we did the pointer arithmetic explicitly rather
> than used an index off the end of the array.
> 
> > +	for (init = 0; init < ipds; init++) {
> > +		for (targ = 0; targ < tpds; targ++) {
> > +			value = entries[init * tpds + targ];
> > +			value = (value * hmat_loc->entry_base_unit) / 10;
> > +			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
> > +				inits[init], targs[targ], value,
> > +				hmat_data_type_suffix(type));
> 
> Worth checking at this early stage that the domains exist in SRAT?
> + screaming if they don't.

Sure, I think it should be sufficient to check pxm_to_node() for a valid
value to validate the table is okay..

> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static __init int hmat_parse_cache(union acpi_subtable_headers *header,
> > +				   const unsigned long end)
> > +{
> > +	struct acpi_hmat_cache *cache = (void *)header;
> > +	u32 attrs;
> > +
> > +	if (cache->header.length < sizeof(*cache)) {
> > +		pr_debug("HMAT: Unexpected cache header length: %d\n",
> > +			 cache->header.length);
> > +		return -EINVAL;
> > +	}
> > +
> > +	attrs = cache->cache_attributes;
> > +	pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
> > +		cache->memory_PD, cache->cache_size, attrs,
> > +		cache->number_of_SMBIOShandles);
> 
> Can we sanity check those smbios handles actually match anything?

Will do.
 
> > +
> > +	return 0;
> > +}
> > +
> > +static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
> > +					   const unsigned long end)
> > +{
> > +	struct acpi_hmat_address_range *spa = (void *)header;
> > +
> > +	if (spa->header.length != sizeof(*spa)) {
> > +		pr_debug("HMAT: Unexpected address range header length: %d\n",
> > +			 spa->header.length);
> 
> My gut feeling is that it's much more useful to make this always print rather
> than debug.  Same with other error paths above.  Given the number of times
> broken ACPI tables show up, it's nice to complain really loudly!
> 
> Perhaps others prefer to not do so though so I'll defer to subsystem norms.

Yeah, I demoted these to debug based on earlier feedback. We should
still be operational even with broken HMAT, so I don't want to create
unnecessary panic if its broken, but I agree something should be
immediately noticable if the firmware tables are incorrect. Maybe like
what bad_srat() provides.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-02-06 12:26     ` Jonathan Cameron
  (?)
@ 2019-02-06 16:12     ` Keith Busch
  2019-02-06 16:47       ` Jonathan Cameron
  -1 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-02-06 16:12 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Hansen, Dave, Williams, Dan J

On Wed, Feb 06, 2019 at 04:26:35AM -0800, Jonathan Cameron wrote:
> On Thu, 24 Jan 2019 16:07:18 -0700
> Keith Busch <keith.busch@intel.com> wrote:
> > +What:		/sys/devices/system/node/nodeX/accessY/initiators/
> > +Date:		December 2018
> > +Contact:	Keith Busch <keith.busch@intel.com>
> > +Description:
> > +		The directory containing symlinks to memory initiator
> > +		nodes that have class "Y" access to this target node's
> > +		memory. CPUs and other memory initiators in nodes not in
> > +		the list accessing this node's memory may have different
> > +		performance.
> 
> Also seems to contain the characteristics of those accesses.

Right, but not in this patch. I will append the description for access
characterisitics in the patch that adds those attributes.
 
> > + * 	This function will export node relationships linking which memory
> > + * 	initiator nodes can access memory targets at a given ranked access
> > + * 	class.
> > + */
> > +int register_memory_node_under_compute_node(unsigned int mem_nid,
> > +					    unsigned int cpu_nid,
> > +					    unsigned access)
> > +{
> > +	struct node *init_node, *targ_node;
> > +	struct node_access_nodes *initiator, *target;
> > +	int ret;
> > +
> > +	if (!node_online(cpu_nid) || !node_online(mem_nid))
> > +		return -ENODEV;
> 
> What do we do under memory/node hotplug?  More than likely that will
> apply in such systems (it does in mine for starters).
> Clearly to do the full story we would need _HMT support etc but
> we can do the prebaked version by having hmat entries for nodes
> that aren't online yet (like we do for SRAT).
> 
> Perhaps one for a follow up patch set.  However, I'd like an
> pr_info to indicate that the node is listed but not online currently.

Yes, hot plug is planned to follow on to this series.

> > +
> > +	init_node = node_devices[cpu_nid];
> > +	targ_node = node_devices[mem_nid];
> > +	initiator = node_init_node_access(init_node, access);
> > +	target = node_init_node_access(targ_node, access);
> > +	if (!initiator || !target)
> > +		return -ENOMEM;
>
> If one of these fails and the other doesn't + the one that succeeded
> did an init, don't we end up leaking a device here?  I'd expect this
> function to not leave things hanging if it has an error. It should
> unwind anything it has done.  It has been added to the list so
> could be cleaned up later, but I'm not seeing that code. 
> 
> These only get cleaned up when the node is removed.

The intiator-target relationship is many-to-many, so we don't want to
free it just because we couldn't allocate its pairing node. The
exisiting one may still be paired to others we were able to allocate.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 10/10] doc/mm: New documentation for memory performance
  2019-02-06 10:45     ` Jonathan Cameron
  (?)
@ 2019-02-06 16:25     ` Keith Busch
  -1 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2019-02-06 16:25 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Wed, Feb 06, 2019 at 10:45:52AM +0000, Jonathan Cameron wrote:
> On Thu, 24 Jan 2019 16:07:24 -0700
> Keith Busch <keith.busch@intel.com> wrote:
> > +	# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/
> > +	/sys/devices/system/node/nodeY/access0/
> > +	|-- read_bandwidth
> > +	|-- read_latency
> > +	|-- write_bandwidth
> > +	`-- write_latency
> 
> These seem to be under
> /sys/devices/system/node/nodeY/access0/initiators/
> (so one directory deeper).

You're right, I used data from the previous series to generate that.
 
> > +	# tree sys/devices/system/node/node0/side_cache/
> > +	/sys/devices/system/node/node0/side_cache/
> > +	|-- index1
> > +	|   |-- associativity
> > +	|   |-- level
> 
> What is the purpose of having level in here?  Isn't it the same as the A..C
> in the index naming?

Yes, it is redundant with the name. I will remove it.
 
> > +	|   |-- line_size
> > +	|   |-- size
> > +	|   `-- write_policy
> > +
> > +The "associativity" will be 0 if it is a direct-mapped cache, and non-zero
> > +for any other indexed based, multi-way associativity.
> 
> Is it worth providing the ACPI mapping in this doc?  We have None, Direct and
> 'complex'.   Fun question of what None means?  Not specified?

Yeah, my take on "none" was that it's unreported and we don't know what
is actually happening..

> > +
> > +The "level" is the distance from the far memory, and matches the number
> > +appended to its "index" directory.
> > +
> > +The "line_size" is the number of bytes accessed on a cache miss.
> 
> Maybe "number of bytes accessed from next cache level" ?

Sounds good.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory
  2019-02-06 16:06     ` Keith Busch
@ 2019-02-06 16:39         ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 16:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams


...

> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
> > > +					   const unsigned long end)
> > > +{
> > > +	struct acpi_hmat_address_range *spa = (void *)header;
> > > +
> > > +	if (spa->header.length != sizeof(*spa)) {
> > > +		pr_debug("HMAT: Unexpected address range header length: %d\n",
> > > +			 spa->header.length);  
> > 
> > My gut feeling is that it's much more useful to make this always print rather
> > than debug.  Same with other error paths above.  Given the number of times
> > broken ACPI tables show up, it's nice to complain really loudly!
> > 
> > Perhaps others prefer to not do so though so I'll defer to subsystem norms.  
> 
> Yeah, I demoted these to debug based on earlier feedback. We should
> still be operational even with broken HMAT, so I don't want to create
> unnecessary panic if its broken, but I agree something should be
> immediately noticable if the firmware tables are incorrect. Maybe like
> what bad_srat() provides.

Agreed. Something general like that would be great. Let's people know they
should turn debug on.

Thanks,

Jonathan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory
@ 2019-02-06 16:39         ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 16:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams


...

> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
> > > +					   const unsigned long end)
> > > +{
> > > +	struct acpi_hmat_address_range *spa = (void *)header;
> > > +
> > > +	if (spa->header.length != sizeof(*spa)) {
> > > +		pr_debug("HMAT: Unexpected address range header length: %d\n",
> > > +			 spa->header.length);  
> > 
> > My gut feeling is that it's much more useful to make this always print rather
> > than debug.  Same with other error paths above.  Given the number of times
> > broken ACPI tables show up, it's nice to complain really loudly!
> > 
> > Perhaps others prefer to not do so though so I'll defer to subsystem norms.  
> 
> Yeah, I demoted these to debug based on earlier feedback. We should
> still be operational even with broken HMAT, so I don't want to create
> unnecessary panic if its broken, but I agree something should be
> immediately noticable if the firmware tables are incorrect. Maybe like
> what bad_srat() provides.

Agreed. Something general like that would be great. Let's people know they
should turn debug on.

Thanks,

Jonathan




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-02-06 16:12     ` Keith Busch
@ 2019-02-06 16:47       ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 16:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Hansen, Dave, Williams, Dan J

On Wed, 6 Feb 2019 09:12:27 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Wed, Feb 06, 2019 at 04:26:35AM -0800, Jonathan Cameron wrote:
> > On Thu, 24 Jan 2019 16:07:18 -0700
> > Keith Busch <keith.busch@intel.com> wrote:  
> > > +What:		/sys/devices/system/node/nodeX/accessY/initiators/
> > > +Date:		December 2018
> > > +Contact:	Keith Busch <keith.busch@intel.com>
> > > +Description:
> > > +		The directory containing symlinks to memory initiator
> > > +		nodes that have class "Y" access to this target node's
> > > +		memory. CPUs and other memory initiators in nodes not in
> > > +		the list accessing this node's memory may have different
> > > +		performance.  
> > 
> > Also seems to contain the characteristics of those accesses.  
> 
> Right, but not in this patch. I will append the description for access
> characterisitics in the patch that adds those attributes.
>  
> > > + * 	This function will export node relationships linking which memory
> > > + * 	initiator nodes can access memory targets at a given ranked access
> > > + * 	class.
> > > + */
> > > +int register_memory_node_under_compute_node(unsigned int mem_nid,
> > > +					    unsigned int cpu_nid,
> > > +					    unsigned access)
> > > +{
> > > +	struct node *init_node, *targ_node;
> > > +	struct node_access_nodes *initiator, *target;
> > > +	int ret;
> > > +
> > > +	if (!node_online(cpu_nid) || !node_online(mem_nid))
> > > +		return -ENODEV;  
> > 
> > What do we do under memory/node hotplug?  More than likely that will
> > apply in such systems (it does in mine for starters).
> > Clearly to do the full story we would need _HMT support etc but
> > we can do the prebaked version by having hmat entries for nodes
> > that aren't online yet (like we do for SRAT).
> > 
> > Perhaps one for a follow up patch set.  However, I'd like an
> > pr_info to indicate that the node is listed but not online currently.  
> 
> Yes, hot plug is planned to follow on to this series.
> 
> > > +
> > > +	init_node = node_devices[cpu_nid];
> > > +	targ_node = node_devices[mem_nid];
> > > +	initiator = node_init_node_access(init_node, access);
> > > +	target = node_init_node_access(targ_node, access);
> > > +	if (!initiator || !target)
> > > +		return -ENOMEM;  
> >
> > If one of these fails and the other doesn't + the one that succeeded
> > did an init, don't we end up leaking a device here?  I'd expect this
> > function to not leave things hanging if it has an error. It should
> > unwind anything it has done.  It has been added to the list so
> > could be cleaned up later, but I'm not seeing that code. 
> > 
> > These only get cleaned up when the node is removed.  
> 
> The intiator-target relationship is many-to-many, so we don't want to
> free it just because we couldn't allocate its pairing node. The
> exisiting one may still be paired to others we were able to allocate.

Reference count them?  We have lots of paths that can result in
creation any of which might need cleaning up. Sounds like a classic
case for reference counts.

Jonathan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
  2019-02-06 12:31   ` Jonathan Cameron
  (?)
@ 2019-02-06 17:19   ` Keith Busch
  2019-02-06 17:30       ` Jonathan Cameron
  -1 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-02-06 17:19 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Wed, Feb 06, 2019 at 12:31:00PM +0000, Jonathan Cameron wrote:
> On Thu, 24 Jan 2019 16:07:14 -0700
> Keith Busch <keith.busch@intel.com> wrote:
> 
> 1) It seems this version added a hard dependence on having the memory node
>    listed in the Memory Proximity Domain attribute structures.  I'm not 100%
>    sure there is actually any requirement to have those structures. If you aren't
>    using the hint bit, they don't convey any information.  It could be argued
>    that they provide info on what is found in the other hmat entries, but there
>    is little purpose as those entries are explicit in what the provide.
>    (Given I didn't have any of these structures and things  worked fine with
>     v4 it seems this is a new check).

Right, v4 just used the node(s) with the highest performance. You mentioned
systems having nodes with different performance, but no winner across all
attributes, so there's no clear way to rank these for access class linkage.
Requiring an initiator PXM present clears that up.

Maybe we can fallback to performance if the initiator pxm isn't provided,
but the ranking is going to require an arbitrary decision, like prioritize
latency over bandwidth.
 
>    This is also somewhat inconsistent.
>    a) If a given entry isn't there, we still get for example
>       node4/access0/initiators/[read|write]_* but all values are 0.
>       If we want to do the check you have it needs to not create the files in
>       this case.  Whilst they have no meaning as there are no initiators, it
>       is inconsistent to my mind.
> 
>    b) Having one "Memory Proximity Domain attribute structure" for node 4 linking
>       it to node0 is sufficient to allow
>       node4/access0/initiators/node0
>       node4/access0/initiators/node1
>       node4/access0/initiators/node2
>       node4/access0/initiators/node3
>       I think if we are going to enforce the presence of that structure then only
>       the node0 link should exist.

We'd link the initiator pxm in the Address Range Structure, and also any
other nodes with identical performance access. I think that makes sense.
 
> 2) Error handling could perhaps do to spit out some nasty warnings.
>    If we have an entry for nodes that don't exist we shouldn't just fail silently,
>    that's just one example I managed to trigger with minor table tweaking.
> 
> Personally I would just get rid of enforcing anything based on the presence of
> that structure.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
  2019-02-06 17:19   ` Keith Busch
@ 2019-02-06 17:30       ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 17:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Wed, 6 Feb 2019 10:19:37 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Wed, Feb 06, 2019 at 12:31:00PM +0000, Jonathan Cameron wrote:
> > On Thu, 24 Jan 2019 16:07:14 -0700
> > Keith Busch <keith.busch@intel.com> wrote:
> > 
> > 1) It seems this version added a hard dependence on having the memory node
> >    listed in the Memory Proximity Domain attribute structures.  I'm not 100%
> >    sure there is actually any requirement to have those structures. If you aren't
> >    using the hint bit, they don't convey any information.  It could be argued
> >    that they provide info on what is found in the other hmat entries, but there
> >    is little purpose as those entries are explicit in what the provide.
> >    (Given I didn't have any of these structures and things  worked fine with
> >     v4 it seems this is a new check).  
> 
> Right, v4 just used the node(s) with the highest performance. You mentioned
> systems having nodes with different performance, but no winner across all
> attributes, so there's no clear way to rank these for access class linkage.
> Requiring an initiator PXM present clears that up.
> 
> Maybe we can fallback to performance if the initiator pxm isn't provided,
> but the ranking is going to require an arbitrary decision, like prioritize
> latency over bandwidth.

I'd certainly prefer to see that fall back and would argue it is
the only valid route.  What is 'best' if we don't put a preference on
one parameter over the other.

Perfectly fine to have another access class that does bandwidth preferred
if that is of sufficient use to people.

>  
> >    This is also somewhat inconsistent.
> >    a) If a given entry isn't there, we still get for example
> >       node4/access0/initiators/[read|write]_* but all values are 0.
> >       If we want to do the check you have it needs to not create the files in
> >       this case.  Whilst they have no meaning as there are no initiators, it
> >       is inconsistent to my mind.
> > 
> >    b) Having one "Memory Proximity Domain attribute structure" for node 4 linking
> >       it to node0 is sufficient to allow
> >       node4/access0/initiators/node0
> >       node4/access0/initiators/node1
> >       node4/access0/initiators/node2
> >       node4/access0/initiators/node3
> >       I think if we are going to enforce the presence of that structure then only
> >       the node0 link should exist.  
> 
> We'd link the initiator pxm in the Address Range Structure, and also any
> other nodes with identical performance access. I think that makes sense.

I disagree on this. It is either / or, it seem really illogical to build
all of them if only one initiator is specified for the target.

If someone deliberately only specified one initiator for this target then they
meant to do that (hopefully).  Probably because they wanted to set one
of the flags.

>  
> > 2) Error handling could perhaps do to spit out some nasty warnings.
> >    If we have an entry for nodes that don't exist we shouldn't just fail silently,
> >    that's just one example I managed to trigger with minor table tweaking.
> > 
> > Personally I would just get rid of enforcing anything based on the presence of
> > that structure.  

Thanks,

Jonathan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
@ 2019-02-06 17:30       ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-06 17:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Wed, 6 Feb 2019 10:19:37 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Wed, Feb 06, 2019 at 12:31:00PM +0000, Jonathan Cameron wrote:
> > On Thu, 24 Jan 2019 16:07:14 -0700
> > Keith Busch <keith.busch@intel.com> wrote:
> > 
> > 1) It seems this version added a hard dependence on having the memory node
> >    listed in the Memory Proximity Domain attribute structures.  I'm not 100%
> >    sure there is actually any requirement to have those structures. If you aren't
> >    using the hint bit, they don't convey any information.  It could be argued
> >    that they provide info on what is found in the other hmat entries, but there
> >    is little purpose as those entries are explicit in what the provide.
> >    (Given I didn't have any of these structures and things  worked fine with
> >     v4 it seems this is a new check).  
> 
> Right, v4 just used the node(s) with the highest performance. You mentioned
> systems having nodes with different performance, but no winner across all
> attributes, so there's no clear way to rank these for access class linkage.
> Requiring an initiator PXM present clears that up.
> 
> Maybe we can fallback to performance if the initiator pxm isn't provided,
> but the ranking is going to require an arbitrary decision, like prioritize
> latency over bandwidth.

I'd certainly prefer to see that fall back and would argue it is
the only valid route.  What is 'best' if we don't put a preference on
one parameter over the other.

Perfectly fine to have another access class that does bandwidth preferred
if that is of sufficient use to people.

>  
> >    This is also somewhat inconsistent.
> >    a) If a given entry isn't there, we still get for example
> >       node4/access0/initiators/[read|write]_* but all values are 0.
> >       If we want to do the check you have it needs to not create the files in
> >       this case.  Whilst they have no meaning as there are no initiators, it
> >       is inconsistent to my mind.
> > 
> >    b) Having one "Memory Proximity Domain attribute structure" for node 4 linking
> >       it to node0 is sufficient to allow
> >       node4/access0/initiators/node0
> >       node4/access0/initiators/node1
> >       node4/access0/initiators/node2
> >       node4/access0/initiators/node3
> >       I think if we are going to enforce the presence of that structure then only
> >       the node0 link should exist.  
> 
> We'd link the initiator pxm in the Address Range Structure, and also any
> other nodes with identical performance access. I think that makes sense.

I disagree on this. It is either / or, it seem really illogical to build
all of them if only one initiator is specified for the target.

If someone deliberately only specified one initiator for this target then they
meant to do that (hopefully).  Probably because they wanted to set one
of the flags.

>  
> > 2) Error handling could perhaps do to spit out some nasty warnings.
> >    If we have an entry for nodes that don't exist we shouldn't just fail silently,
> >    that's just one example I managed to trigger with minor table tweaking.
> > 
> > Personally I would just get rid of enforcing anything based on the presence of
> > that structure.  

Thanks,

Jonathan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-02-05 15:17         ` Rafael J. Wysocki
  (?)
@ 2019-02-06 23:09         ` Keith Busch
  2019-02-06 23:48             ` Rafael J. Wysocki
  -1 siblings, 1 reply; 53+ messages in thread
From: Keith Busch @ 2019-02-06 23:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen, Dan Williams

On Tue, Feb 05, 2019 at 04:17:09PM +0100, Rafael J. Wysocki wrote:
> <gregkh@linuxfoundation.org> wrote:
> >
> > When you use a "raw" kobject then userspace tools do not see the devices
> > and attributes in libraries like udev.
> 
> And why would they need it in this particular case?
> 
> > So unless userspace does not care about this at all,
> 
> Which I think is the case here, isn't it?
> 
> > you should use a 'struct device' where ever
> > possible.  The memory "savings" usually just isn't worth it unless you
> > have a _lot_ of objects being created here.
> >
> > Who is going to use all of this new information?
> 
> Somebody who wants to know how the memory in the system is laid out AFAICS.

Yes, this is for user space to make informed decisions about where it
wants to allocate/relocate hot and cold data with respect to particular
compute domains. So user space should care about these attributes,
and they won't always be static when memory hotplug support for these
attributes is added.

Does that change anything, or still recommending kobject? I don't have a
strong opinion either way and have both options coded and ready to
submit new version once I know which direction is most acceptable.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-02-06 23:09         ` Keith Busch
@ 2019-02-06 23:48             ` Rafael J. Wysocki
  0 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-06 23:48 UTC (permalink / raw)
  To: Keith Busch
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen, Dan Williams

On Thu, Feb 7, 2019 at 12:10 AM Keith Busch <keith.busch@intel.com> wrote:
>
> On Tue, Feb 05, 2019 at 04:17:09PM +0100, Rafael J. Wysocki wrote:
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > When you use a "raw" kobject then userspace tools do not see the devices
> > > and attributes in libraries like udev.
> >
> > And why would they need it in this particular case?
> >
> > > So unless userspace does not care about this at all,
> >
> > Which I think is the case here, isn't it?
> >
> > > you should use a 'struct device' where ever
> > > possible.  The memory "savings" usually just isn't worth it unless you
> > > have a _lot_ of objects being created here.
> > >
> > > Who is going to use all of this new information?
> >
> > Somebody who wants to know how the memory in the system is laid out AFAICS.
>
> Yes, this is for user space to make informed decisions about where it
> wants to allocate/relocate hot and cold data with respect to particular
> compute domains. So user space should care about these attributes,
> and they won't always be static when memory hotplug support for these
> attributes is added.
>
> Does that change anything, or still recommending kobject? I don't have a
> strong opinion either way and have both options coded and ready to
> submit new version once I know which direction is most acceptable.

If you want to make dynamic changes to the sysfs directories under
this object, uevents generated by device registration and
unregstration may be useful.  However, they only trigger automatically
when you register and unregister, so presumably you'd need to do that
every time for the changes to trigger an update in user space.  Also,
whoever is interested in this data would need to listen to the uevents
to get notified.

OTOH, you can call kobject_uevent() for the "raw" kobjects too.

Anyway, if Greg really prefers struct device to be used here, that's
fine by me, but since the uevents in question are going to be part of
your user space I/F then, it may be good to take that into
consideration. :-)

After all, you need to know how you want the I/F to work.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
@ 2019-02-06 23:48             ` Rafael J. Wysocki
  0 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-06 23:48 UTC (permalink / raw)
  To: Keith Busch
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen, Dan Williams

On Thu, Feb 7, 2019 at 12:10 AM Keith Busch <keith.busch@intel.com> wrote:
>
> On Tue, Feb 05, 2019 at 04:17:09PM +0100, Rafael J. Wysocki wrote:
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > When you use a "raw" kobject then userspace tools do not see the devices
> > > and attributes in libraries like udev.
> >
> > And why would they need it in this particular case?
> >
> > > So unless userspace does not care about this at all,
> >
> > Which I think is the case here, isn't it?
> >
> > > you should use a 'struct device' where ever
> > > possible.  The memory "savings" usually just isn't worth it unless you
> > > have a _lot_ of objects being created here.
> > >
> > > Who is going to use all of this new information?
> >
> > Somebody who wants to know how the memory in the system is laid out AFAICS.
>
> Yes, this is for user space to make informed decisions about where it
> wants to allocate/relocate hot and cold data with respect to particular
> compute domains. So user space should care about these attributes,
> and they won't always be static when memory hotplug support for these
> attributes is added.
>
> Does that change anything, or still recommending kobject? I don't have a
> strong opinion either way and have both options coded and ready to
> submit new version once I know which direction is most acceptable.

If you want to make dynamic changes to the sysfs directories under
this object, uevents generated by device registration and
unregstration may be useful.  However, they only trigger automatically
when you register and unregister, so presumably you'd need to do that
every time for the changes to trigger an update in user space.  Also,
whoever is interested in this data would need to listen to the uevents
to get notified.

OTOH, you can call kobject_uevent() for the "raw" kobjects too.

Anyway, if Greg really prefers struct device to be used here, that's
fine by me, but since the uevents in question are going to be part of
your user space I/F then, it may be good to take that into
consideration. :-)

After all, you need to know how you want the I/F to work.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
  2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
@ 2019-02-07  9:53   ` Jonathan Cameron
  2019-01-24 23:07 ` [PATCHv5 02/10] acpi: Add HMAT to generic parsing tables Keith Busch
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-07  9:53 UTC (permalink / raw)
  To: Keith Busch, linux-mm
  Cc: linux-kernel, linux-acpi, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams, linuxarm

On Thu, 24 Jan 2019 16:07:14 -0700
Keith Busch <keith.busch@intel.com> wrote:

> == Changes since v4 ==
> 
>   All public interfaces have kernel docs.
> 
>   Renamed "class" to "access", docs and changed logs updated
>   accordingly. (Rafael)
> 
>   The sysfs hierarchy is altered to put initiators and targets in their
>   own attribute group directories (Rafael).
> 
>   The node lists are removed. This feedback is in conflict with v1
>   feedback, but consensus wants to remove multi-value sysfs attributes,
>   which includes lists. We only have symlinks now, just like v1 provided.
> 
>   Documentation and code patches are combined such that the code
>   introducing new attributes and its documentation are in the same
>   patch. (Rafael and Dan).
> 
>   The performance attributes, bandwidth and latency, are moved into the
>   initiators directory. This should make it obvious for which node
>   access the attributes apply, which was previously ambiguous.
>   (Jonathan Cameron).
> 
>   The HMAT code selecting "local" initiators is substantially changed.
>   Only PXM's that have identical performance to the HMAT's processor PXM
>   in Address Range Structure are registered. This is to avoid considering
>   nodes identical when only one of several perf attributes are the same.
>   (Jonathan Cameron).
> 
>   Verbose variable naming. Examples include "initiator" and "target"
>   instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
>   "p". (Rafael)
> 
>   Compile fixes for when HMEM_REPORTING is not set. This is not a user
>   selectable config option, default 'n', and will have to be selected
>   by other config options that require it (Greg KH and Rafael).
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

As a general heads up, ACPI 6.3 is out and makes some changes.
Discussions I've had in the past suggested there were few systems
shipping with 6.2 HMAT and that many firmwares would start at 6.3.
Of course, that might not be true, but there was fairly wide participation
in the meeting so fingers crossed it's accurate.

https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

Particular points to note:
1. Most of the Memory Proximity Domain Attributes Structure was deprecated.
   This includes the reservation hint which has been replaced
   with a new mechanism (not used in this patch set)

2. Base units for latency changed to picoseconds.  There is a lot more
   explanatory text around how those work.

3. The measurements of latency and bandwidth no longer have an
   'aggregate performance' version.  Given the work load was not described
   this never made any sense.  Better for a knowledgeable bit of software
   to work out it's own estimate.

4. There are now Generic Initiator Domains that have neither memory nor
   processors.  I'll come back with proposals on handling those soon if
   no one beats me to it. (I think it's really easy but may be wrong ;)
   I've not really thought out how this series applies to GI only domains
   yet.  Probably not useful to know you have an accelerator near to
   particular memory if you are deciding where to pin your host processor
   task ;)

Jonathan

> 
> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 354 ++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1344 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
@ 2019-02-07  9:53   ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2019-02-07  9:53 UTC (permalink / raw)
  To: Keith Busch, linux-mm
  Cc: linux-kernel, linux-acpi, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams, linuxarm

On Thu, 24 Jan 2019 16:07:14 -0700
Keith Busch <keith.busch@intel.com> wrote:

> == Changes since v4 ==
> 
>   All public interfaces have kernel docs.
> 
>   Renamed "class" to "access", docs and changed logs updated
>   accordingly. (Rafael)
> 
>   The sysfs hierarchy is altered to put initiators and targets in their
>   own attribute group directories (Rafael).
> 
>   The node lists are removed. This feedback is in conflict with v1
>   feedback, but consensus wants to remove multi-value sysfs attributes,
>   which includes lists. We only have symlinks now, just like v1 provided.
> 
>   Documentation and code patches are combined such that the code
>   introducing new attributes and its documentation are in the same
>   patch. (Rafael and Dan).
> 
>   The performance attributes, bandwidth and latency, are moved into the
>   initiators directory. This should make it obvious for which node
>   access the attributes apply, which was previously ambiguous.
>   (Jonathan Cameron).
> 
>   The HMAT code selecting "local" initiators is substantially changed.
>   Only PXM's that have identical performance to the HMAT's processor PXM
>   in Address Range Structure are registered. This is to avoid considering
>   nodes identical when only one of several perf attributes are the same.
>   (Jonathan Cameron).
> 
>   Verbose variable naming. Examples include "initiator" and "target"
>   instead of "i" and "t", "mem_pxm" and "cpu_pxm" instead of "m" and
>   "p". (Rafael)
> 
>   Compile fixes for when HMEM_REPORTING is not set. This is not a user
>   selectable config option, default 'n', and will have to be selected
>   by other config options that require it (Greg KH and Rafael).
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

As a general heads up, ACPI 6.3 is out and makes some changes.
Discussions I've had in the past suggested there were few systems
shipping with 6.2 HMAT and that many firmwares would start at 6.3.
Of course, that might not be true, but there was fairly wide participation
in the meeting so fingers crossed it's accurate.

https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

Particular points to note:
1. Most of the Memory Proximity Domain Attributes Structure was deprecated.
   This includes the reservation hint which has been replaced
   with a new mechanism (not used in this patch set)

2. Base units for latency changed to picoseconds.  There is a lot more
   explanatory text around how those work.

3. The measurements of latency and bandwidth no longer have an
   'aggregate performance' version.  Given the work load was not described
   this never made any sense.  Better for a knowledgeable bit of software
   to work out it's own estimate.

4. There are now Generic Initiator Domains that have neither memory nor
   processors.  I'll come back with proposals on handling those soon if
   no one beats me to it. (I think it's really easy but may be wrong ;)
   I've not really thought out how this series applies to GI only domains
   yet.  Probably not useful to know you have an accelerator near to
   particular memory if you are deciding where to pin your host processor
   task ;)

Jonathan

> 
> Keith Busch (10):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 ++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 167 ++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +-
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 537 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 354 ++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  60 ++-
>  25 files changed, 1344 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
  2019-01-24 23:07 ` [PATCHv5 04/10] node: Link memory nodes to their compute nodes Keith Busch
@ 2019-02-07 11:35     ` Rafael J. Wysocki
  2019-02-06 12:26     ` Jonathan Cameron
  2019-02-07 11:35     ` Rafael J. Wysocki
  2 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-07 11:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Fri, Jan 25, 2019 at 12:08 AM Keith Busch <keith.busch@intel.com> wrote:
>
> Systems may be constructed with various specialized nodes. Some nodes
> may provide memory, some provide compute devices that access and use
> that memory, and others may provide both. Nodes that provide memory are
> referred to as memory targets, and nodes that can initiate memory access
> are referred to as memory initiators.
>
> Memory targets will often have varying access characteristics from
> different initiators, and platforms may have ways to express those
> relationships. In preparation for these systems, provide interfaces for
> the kernel to export the memory relationship among different nodes memory
> targets and their initiators with symlinks to each other.
>
> If a system provides access locality for each initiator-target pair, nodes
> may be grouped into ranked access classes relative to other nodes. The
> new interface allows a subsystem to register relationships of varying
> classes if available and desired to be exported.
>
> A memory initiator may have multiple memory targets in the same access
> class. The target memory's initiators in a given class indicate the
> nodes access characteristics share the same performance relative to other
> linked initiator nodes. Each target within an initiator's access class,
> though, do not necessarily perform the same as each other.
>
> A memory target node may have multiple memory initiators. All linked
> initiators in a target's class have the same access characteristics to
> that target.
>
> The following example show the nodes' new sysfs hierarchy for a memory
> target node 'Y' with access class 0 from initiator node 'X':
>
>   # symlinks -v /sys/devices/system/node/nodeX/access0/
>   relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
>
>   # symlinks -v /sys/devices/system/node/nodeY/access0/
>   relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
>
> The new attributes are added to the sysfs stable documentation.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>

Overall, if you decide to go for full struct device embedded in struct
node_access_nodes, feel free to add

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

to this patch.

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  25 ++++-
>  drivers/base/node.c                         | 142 +++++++++++++++++++++++++++-
>  include/linux/node.h                        |   7 +-
>  3 files changed, 171 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..fb843222a281 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:                December 2009
>  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
>  Description:
>                 The node's huge page size control/query attributes.
> -               See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +               See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:          /sys/devices/system/node/nodeX/accessY/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The node's relationship to other nodes for access class "Y".
> +
> +What:          /sys/devices/system/node/nodeX/accessY/initiators/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The directory containing symlinks to memory initiator
> +               nodes that have class "Y" access to this target node's
> +               memory. CPUs and other memory initiators in nodes not in
> +               the list accessing this node's memory may have different
> +               performance.
> +
> +What:          /sys/devices/system/node/nodeX/classY/targets/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The directory containing symlinks to memory targets that
> +               this initiator node has class "Y" access.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..6f4097680580 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/cpu.h>
>  #include <linux/device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
>
> @@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev,
>  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>
> +/**
> + * struct node_access_nodes - Access class device to hold user visible
> + *                           relationships to other nodes.
> + * @dev:       Device for this memory access class
> + * @list_node: List element in the node's access list
> + * @access:    The access class rank
> + */
> +struct node_access_nodes {
> +       struct device           dev;
> +       struct list_head        list_node;
> +       unsigned                access;
> +};
> +#define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
> +
> +static struct attribute *node_init_access_node_attrs[] = {
> +       NULL,
> +};
> +
> +static struct attribute *node_targ_access_node_attrs[] = {
> +       NULL,
> +};
> +
> +static const struct attribute_group initiators = {
> +       .name   = "initiators",
> +       .attrs  = node_init_access_node_attrs,
> +};
> +
> +static const struct attribute_group targets = {
> +       .name   = "targets",
> +       .attrs  = node_targ_access_node_attrs,
> +};
> +
> +static const struct attribute_group *node_access_node_groups[] = {
> +       &initiators,
> +       &targets,
> +       NULL,
> +};
> +
> +static void node_remove_accesses(struct node *node)
> +{
> +       struct node_access_nodes *c, *cnext;
> +
> +       list_for_each_entry_safe(c, cnext, &node->access_list, list_node) {
> +               list_del(&c->list_node);
> +               device_unregister(&c->dev);
> +       }
> +}
> +
> +static void node_access_release(struct device *dev)
> +{
> +       kfree(to_access_nodes(dev));
> +}
> +
> +static struct node_access_nodes *node_init_node_access(struct node *node,
> +                                                      unsigned access)
> +{
> +       struct node_access_nodes *access_node;
> +       struct device *dev;
> +
> +       list_for_each_entry(access_node, &node->access_list, list_node)
> +               if (access_node->access == access)
> +                       return access_node;
> +
> +       access_node = kzalloc(sizeof(*access_node), GFP_KERNEL);
> +       if (!access_node)
> +               return NULL;
> +
> +       access_node->access = access;
> +       dev = &access_node->dev;
> +       dev->parent = &node->dev;
> +       dev->release = node_access_release;
> +       dev->groups = node_access_node_groups;
> +       if (dev_set_name(dev, "access%u", access))
> +               goto free;
> +
> +       if (device_register(dev))
> +               goto free_name;
> +
> +       pm_runtime_no_callbacks(dev);
> +       list_add_tail(&access_node->list_node, &node->access_list);
> +       return access_node;
> +free_name:
> +       kfree_const(dev->kobj.name);
> +free:
> +       kfree(access_node);
> +       return NULL;
> +}
> +
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>  static ssize_t node_read_meminfo(struct device *dev,
>                         struct device_attribute *attr, char *buf)
> @@ -340,7 +429,7 @@ static int register_node(struct node *node, int num)
>  void unregister_node(struct node *node)
>  {
>         hugetlb_unregister_node(node);          /* no-op, if memoryless node */
> -
> +       node_remove_accesses(node);
>         device_unregister(&node->dev);
>  }
>
> @@ -372,6 +461,56 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
>                                  kobject_name(&node_devices[nid]->dev.kobj));
>  }
>
> +/**
> + * register_memory_node_under_compute_node - link memory node to its compute
> + *                                          node for a given access class.
> + * @mem_node:  Memory node number
> + * @cpu_node:  Cpu  node number
> + * @access:    Access class to register
> + *
> + * Description:
> + *     For use with platforms that may have separate memory and compute nodes.
> + *     This function will export node relationships linking which memory
> + *     initiator nodes can access memory targets at a given ranked access
> + *     class.
> + */
> +int register_memory_node_under_compute_node(unsigned int mem_nid,
> +                                           unsigned int cpu_nid,
> +                                           unsigned access)
> +{
> +       struct node *init_node, *targ_node;
> +       struct node_access_nodes *initiator, *target;
> +       int ret;
> +
> +       if (!node_online(cpu_nid) || !node_online(mem_nid))
> +               return -ENODEV;
> +
> +       init_node = node_devices[cpu_nid];
> +       targ_node = node_devices[mem_nid];
> +       initiator = node_init_node_access(init_node, access);
> +       target = node_init_node_access(targ_node, access);
> +       if (!initiator || !target)
> +               return -ENOMEM;
> +
> +       ret = sysfs_add_link_to_group(&initiator->dev.kobj, "targets",
> +                                     &targ_node->dev.kobj,
> +                                     dev_name(&targ_node->dev));
> +       if (ret)
> +               return ret;
> +
> +       ret = sysfs_add_link_to_group(&target->dev.kobj, "initiators",
> +                                     &init_node->dev.kobj,
> +                                     dev_name(&init_node->dev));
> +       if (ret)
> +               goto err;
> +
> +       return 0;
> + err:
> +       sysfs_remove_link_from_group(&initiator->dev.kobj, "targets",
> +                                    dev_name(&targ_node->dev));
> +       return ret;
> +}
> +
>  int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
>  {
>         struct device *obj;
> @@ -580,6 +719,7 @@ int __register_one_node(int nid)
>                         register_cpu_under_node(cpu, nid);
>         }
>
> +       INIT_LIST_HEAD(&node_devices[nid]->access_list);
>         /* initialize work queue for memory hot plug */
>         init_node_hugetlb_work(nid);
>
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 257bb3d6d014..f34688a203c1 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -17,11 +17,12 @@
>
>  #include <linux/device.h>
>  #include <linux/cpumask.h>
> +#include <linux/list.h>
>  #include <linux/workqueue.h>
>
>  struct node {
>         struct device   dev;
> -
> +       struct list_head access_list;
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>         struct work_struct      node_work;
>  #endif
> @@ -75,6 +76,10 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>                                            unsigned long phys_index);
>
> +extern int register_memory_node_under_compute_node(unsigned int mem_nid,
> +                                                  unsigned int cpu_nid,
> +                                                  unsigned access);
> +
>  #ifdef CONFIG_HUGETLBFS
>  extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
>                                          node_registration_func_t unregister);
> --
> 2.14.4
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 04/10] node: Link memory nodes to their compute nodes
@ 2019-02-07 11:35     ` Rafael J. Wysocki
  0 siblings, 0 replies; 53+ messages in thread
From: Rafael J. Wysocki @ 2019-02-07 11:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Fri, Jan 25, 2019 at 12:08 AM Keith Busch <keith.busch@intel.com> wrote:
>
> Systems may be constructed with various specialized nodes. Some nodes
> may provide memory, some provide compute devices that access and use
> that memory, and others may provide both. Nodes that provide memory are
> referred to as memory targets, and nodes that can initiate memory access
> are referred to as memory initiators.
>
> Memory targets will often have varying access characteristics from
> different initiators, and platforms may have ways to express those
> relationships. In preparation for these systems, provide interfaces for
> the kernel to export the memory relationship among different nodes memory
> targets and their initiators with symlinks to each other.
>
> If a system provides access locality for each initiator-target pair, nodes
> may be grouped into ranked access classes relative to other nodes. The
> new interface allows a subsystem to register relationships of varying
> classes if available and desired to be exported.
>
> A memory initiator may have multiple memory targets in the same access
> class. The target memory's initiators in a given class indicate the
> nodes access characteristics share the same performance relative to other
> linked initiator nodes. Each target within an initiator's access class,
> though, do not necessarily perform the same as each other.
>
> A memory target node may have multiple memory initiators. All linked
> initiators in a target's class have the same access characteristics to
> that target.
>
> The following example show the nodes' new sysfs hierarchy for a memory
> target node 'Y' with access class 0 from initiator node 'X':
>
>   # symlinks -v /sys/devices/system/node/nodeX/access0/
>   relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
>
>   # symlinks -v /sys/devices/system/node/nodeY/access0/
>   relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
>
> The new attributes are added to the sysfs stable documentation.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>

Overall, if you decide to go for full struct device embedded in struct
node_access_nodes, feel free to add

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

to this patch.

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  25 ++++-
>  drivers/base/node.c                         | 142 +++++++++++++++++++++++++++-
>  include/linux/node.h                        |   7 +-
>  3 files changed, 171 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..fb843222a281 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:                December 2009
>  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
>  Description:
>                 The node's huge page size control/query attributes.
> -               See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +               See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:          /sys/devices/system/node/nodeX/accessY/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The node's relationship to other nodes for access class "Y".
> +
> +What:          /sys/devices/system/node/nodeX/accessY/initiators/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The directory containing symlinks to memory initiator
> +               nodes that have class "Y" access to this target node's
> +               memory. CPUs and other memory initiators in nodes not in
> +               the list accessing this node's memory may have different
> +               performance.
> +
> +What:          /sys/devices/system/node/nodeX/classY/targets/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The directory containing symlinks to memory targets that
> +               this initiator node has class "Y" access.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..6f4097680580 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/cpu.h>
>  #include <linux/device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
>
> @@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev,
>  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>
> +/**
> + * struct node_access_nodes - Access class device to hold user visible
> + *                           relationships to other nodes.
> + * @dev:       Device for this memory access class
> + * @list_node: List element in the node's access list
> + * @access:    The access class rank
> + */
> +struct node_access_nodes {
> +       struct device           dev;
> +       struct list_head        list_node;
> +       unsigned                access;
> +};
> +#define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
> +
> +static struct attribute *node_init_access_node_attrs[] = {
> +       NULL,
> +};
> +
> +static struct attribute *node_targ_access_node_attrs[] = {
> +       NULL,
> +};
> +
> +static const struct attribute_group initiators = {
> +       .name   = "initiators",
> +       .attrs  = node_init_access_node_attrs,
> +};
> +
> +static const struct attribute_group targets = {
> +       .name   = "targets",
> +       .attrs  = node_targ_access_node_attrs,
> +};
> +
> +static const struct attribute_group *node_access_node_groups[] = {
> +       &initiators,
> +       &targets,
> +       NULL,
> +};
> +
> +static void node_remove_accesses(struct node *node)
> +{
> +       struct node_access_nodes *c, *cnext;
> +
> +       list_for_each_entry_safe(c, cnext, &node->access_list, list_node) {
> +               list_del(&c->list_node);
> +               device_unregister(&c->dev);
> +       }
> +}
> +
> +static void node_access_release(struct device *dev)
> +{
> +       kfree(to_access_nodes(dev));
> +}
> +
> +static struct node_access_nodes *node_init_node_access(struct node *node,
> +                                                      unsigned access)
> +{
> +       struct node_access_nodes *access_node;
> +       struct device *dev;
> +
> +       list_for_each_entry(access_node, &node->access_list, list_node)
> +               if (access_node->access == access)
> +                       return access_node;
> +
> +       access_node = kzalloc(sizeof(*access_node), GFP_KERNEL);
> +       if (!access_node)
> +               return NULL;
> +
> +       access_node->access = access;
> +       dev = &access_node->dev;
> +       dev->parent = &node->dev;
> +       dev->release = node_access_release;
> +       dev->groups = node_access_node_groups;
> +       if (dev_set_name(dev, "access%u", access))
> +               goto free;
> +
> +       if (device_register(dev))
> +               goto free_name;
> +
> +       pm_runtime_no_callbacks(dev);
> +       list_add_tail(&access_node->list_node, &node->access_list);
> +       return access_node;
> +free_name:
> +       kfree_const(dev->kobj.name);
> +free:
> +       kfree(access_node);
> +       return NULL;
> +}
> +
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>  static ssize_t node_read_meminfo(struct device *dev,
>                         struct device_attribute *attr, char *buf)
> @@ -340,7 +429,7 @@ static int register_node(struct node *node, int num)
>  void unregister_node(struct node *node)
>  {
>         hugetlb_unregister_node(node);          /* no-op, if memoryless node */
> -
> +       node_remove_accesses(node);
>         device_unregister(&node->dev);
>  }
>
> @@ -372,6 +461,56 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
>                                  kobject_name(&node_devices[nid]->dev.kobj));
>  }
>
> +/**
> + * register_memory_node_under_compute_node - link memory node to its compute
> + *                                          node for a given access class.
> + * @mem_node:  Memory node number
> + * @cpu_node:  Cpu  node number
> + * @access:    Access class to register
> + *
> + * Description:
> + *     For use with platforms that may have separate memory and compute nodes.
> + *     This function will export node relationships linking which memory
> + *     initiator nodes can access memory targets at a given ranked access
> + *     class.
> + */
> +int register_memory_node_under_compute_node(unsigned int mem_nid,
> +                                           unsigned int cpu_nid,
> +                                           unsigned access)
> +{
> +       struct node *init_node, *targ_node;
> +       struct node_access_nodes *initiator, *target;
> +       int ret;
> +
> +       if (!node_online(cpu_nid) || !node_online(mem_nid))
> +               return -ENODEV;
> +
> +       init_node = node_devices[cpu_nid];
> +       targ_node = node_devices[mem_nid];
> +       initiator = node_init_node_access(init_node, access);
> +       target = node_init_node_access(targ_node, access);
> +       if (!initiator || !target)
> +               return -ENOMEM;
> +
> +       ret = sysfs_add_link_to_group(&initiator->dev.kobj, "targets",
> +                                     &targ_node->dev.kobj,
> +                                     dev_name(&targ_node->dev));
> +       if (ret)
> +               return ret;
> +
> +       ret = sysfs_add_link_to_group(&target->dev.kobj, "initiators",
> +                                     &init_node->dev.kobj,
> +                                     dev_name(&init_node->dev));
> +       if (ret)
> +               goto err;
> +
> +       return 0;
> + err:
> +       sysfs_remove_link_from_group(&initiator->dev.kobj, "targets",
> +                                    dev_name(&targ_node->dev));
> +       return ret;
> +}
> +
>  int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
>  {
>         struct device *obj;
> @@ -580,6 +719,7 @@ int __register_one_node(int nid)
>                         register_cpu_under_node(cpu, nid);
>         }
>
> +       INIT_LIST_HEAD(&node_devices[nid]->access_list);
>         /* initialize work queue for memory hot plug */
>         init_node_hugetlb_work(nid);
>
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 257bb3d6d014..f34688a203c1 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -17,11 +17,12 @@
>
>  #include <linux/device.h>
>  #include <linux/cpumask.h>
> +#include <linux/list.h>
>  #include <linux/workqueue.h>
>
>  struct node {
>         struct device   dev;
> -
> +       struct list_head access_list;
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>         struct work_struct      node_work;
>  #endif
> @@ -75,6 +76,10 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>                                            unsigned long phys_index);
>
> +extern int register_memory_node_under_compute_node(unsigned int mem_nid,
> +                                                  unsigned int cpu_nid,
> +                                                  unsigned access);
> +
>  #ifdef CONFIG_HUGETLBFS
>  extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
>                                          node_registration_func_t unregister);
> --
> 2.14.4
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCHv5 00/10] Heterogeneuos memory node attributes
  2019-02-07  9:53   ` Jonathan Cameron
  (?)
@ 2019-02-07 15:08   ` Keith Busch
  -1 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2019-02-07 15:08 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-mm, linux-kernel, linux-acpi, Greg Kroah-Hartman,
	Rafael Wysocki, Hansen, Dave, Williams, Dan J, linuxarm

On Thu, Feb 07, 2019 at 01:53:36AM -0800, Jonathan Cameron wrote:
> As a general heads up, ACPI 6.3 is out and makes some changes.
> Discussions I've had in the past suggested there were few systems
> shipping with 6.2 HMAT and that many firmwares would start at 6.3.
> Of course, that might not be true, but there was fairly wide participation
> in the meeting so fingers crossed it's accurate.
> 
> https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
> 
> Particular points to note:
> 1. Most of the Memory Proximity Domain Attributes Structure was deprecated.
>    This includes the reservation hint which has been replaced
>    with a new mechanism (not used in this patch set)

Yes, and duplicating all the address ranges with SRAT never made any
sense. No need to define the same thing in multiple places; that's just
another opprotunity to get it wrong.
 
> 2. Base units for latency changed to picoseconds.  There is a lot more
>    explanatory text around how those work.
>
> 3. The measurements of latency and bandwidth no longer have an
>    'aggregate performance' version.  Given the work load was not described
>    this never made any sense.  Better for a knowledgeable bit of software
>    to work out it's own estimate.

Nice. Though they shifted 1st level cached to occupy the same value that
the aggregate used. They could have just deprecated the old value so we
could maintain compatibility, but that's okay!
 
> 4. There are now Generic Initiator Domains that have neither memory nor
>    processors.  I'll come back with proposals on handling those soon if
>    no one beats me to it. (I think it's really easy but may be wrong ;)
>    I've not really thought out how this series applies to GI only domains
>    yet.  Probably not useful to know you have an accelerator near to
>    particular memory if you are deciding where to pin your host processor
>    task ;)

I haven't any particular use for these at the moment either, though it
shouldn't change what this is going to export.

Thanks for the heads up! I'll incorporate 6.3 into v6.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2019-02-07 15:08 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-24 23:07 [PATCHv5 00/10] Heterogeneuos memory node attributes Keith Busch
2019-01-24 23:07 ` [PATCHv5 01/10] acpi: Create subtable parsing infrastructure Keith Busch
2019-01-24 23:07 ` [PATCHv5 02/10] acpi: Add HMAT to generic parsing tables Keith Busch
2019-01-24 23:07 ` [PATCHv5 03/10] acpi/hmat: Parse and report heterogeneous memory Keith Busch
2019-02-05 12:12   ` Rafael J. Wysocki
2019-02-05 12:12     ` Rafael J. Wysocki
2019-02-06 12:28   ` Jonathan Cameron
2019-02-06 12:28     ` Jonathan Cameron
2019-02-06 16:06     ` Keith Busch
2019-02-06 16:39       ` Jonathan Cameron
2019-02-06 16:39         ` Jonathan Cameron
2019-01-24 23:07 ` [PATCHv5 04/10] node: Link memory nodes to their compute nodes Keith Busch
2019-02-05 12:33   ` Rafael J. Wysocki
2019-02-05 12:33     ` Rafael J. Wysocki
2019-02-05 14:48     ` Keith Busch
2019-02-05 14:52     ` Greg Kroah-Hartman
2019-02-05 15:17       ` Rafael J. Wysocki
2019-02-05 15:17         ` Rafael J. Wysocki
2019-02-06 23:09         ` Keith Busch
2019-02-06 23:48           ` Rafael J. Wysocki
2019-02-06 23:48             ` Rafael J. Wysocki
2019-02-06 12:26   ` Jonathan Cameron
2019-02-06 12:26     ` Jonathan Cameron
2019-02-06 16:12     ` Keith Busch
2019-02-06 16:47       ` Jonathan Cameron
2019-02-07 11:35   ` Rafael J. Wysocki
2019-02-07 11:35     ` Rafael J. Wysocki
2019-01-24 23:07 ` [PATCHv5 05/10] acpi/hmat: Register processor domain to its memory Keith Busch
2019-02-06 12:26   ` Jonathan Cameron
2019-02-06 12:26     ` Jonathan Cameron
2019-01-24 23:07 ` [PATCHv5 06/10] node: Add heterogenous memory access attributes Keith Busch
2019-01-24 23:07 ` [PATCHv5 07/10] acpi/hmat: Register performance attributes Keith Busch
2019-02-06 12:24   ` Jonathan Cameron
2019-02-06 12:24     ` Jonathan Cameron
2019-01-24 23:07 ` [PATCHv5 08/10] node: Add memory caching attributes Keith Busch
2019-02-06 12:24   ` Jonathan Cameron
2019-02-06 12:24     ` Jonathan Cameron
2019-01-24 23:07 ` [PATCHv5 09/10] acpi/hmat: Register memory side cache attributes Keith Busch
2019-02-06 12:17   ` Jonathan Cameron
2019-02-06 12:17     ` Jonathan Cameron
2019-01-24 23:07 ` [PATCHv5 10/10] doc/mm: New documentation for memory performance Keith Busch
2019-02-06 10:45   ` Jonathan Cameron
2019-02-06 10:45     ` Jonathan Cameron
2019-02-06 16:25     ` Keith Busch
2019-01-28 14:00 ` [PATCHv5 00/10] Heterogeneuos memory node attributes Michal Hocko
2019-02-06 12:31 ` Jonathan Cameron
2019-02-06 12:31   ` Jonathan Cameron
2019-02-06 17:19   ` Keith Busch
2019-02-06 17:30     ` Jonathan Cameron
2019-02-06 17:30       ` Jonathan Cameron
2019-02-07  9:53 ` Jonathan Cameron
2019-02-07  9:53   ` Jonathan Cameron
2019-02-07 15:08   ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.