linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv4 00/13] Heterogeneuos memory node attributes
@ 2019-01-16 17:57 Keith Busch
  2019-01-16 17:57 ` [PATCHv4 01/13] acpi: Create subtable parsing infrastructure Keith Busch
                   ` (14 more replies)
  0 siblings, 15 replies; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

The series seems quite calm now. I've received some approvals of the
on the proposal, and heard no objections on the new core interfaces.

Please let me know if there is anyone or group of people I should request
and wait for a review. And if anyone reading this would like additional
time as well before I post a potentially subsequent version, please let
me know.

I also wanted to inquire on upstream strategy if/when all desired
reviews are received. The series is spanning a few subsystems, so I'm
not sure who's tree is the best candidate. I could see an argument for
driver-core, acpi, or mm as possible paths. Please let me know if there's
a more appropriate option or any other gating concerns.

== Changes from v3 ==

  I've fixed the documentation issues that have been raised for v3 

  Moved the hmat files according to Rafael's recommendation

  Added received Reviewed-by's

Otherwise this v4 is much the same as v3.

== Background ==

Platforms may provide multiple types of cpu attached system memory. The
memory ranges for each type may have different characteristics that
applications may wish to know about when considering what node they want
their memory allocated from. 

It had previously been difficult to describe these setups as memory
rangers were generally lumped into the NUMA node of the CPUs. New
platform attributes have been created and in use today that describe
the more complex memory hierarchies that can be created.

This series' objective is to provide the attributes from such systems
that are useful for applications to know about, and readily usable with
existing tools and libraries.

Keith Busch (13):
  acpi: Create subtable parsing infrastructure
  acpi: Add HMAT to generic parsing tables
  acpi/hmat: Parse and report heterogeneous memory
  node: Link memory nodes to their compute nodes
  Documentation/ABI: Add new node sysfs attributes
  acpi/hmat: Register processor domain to its memory
  node: Add heterogenous memory access attributes
  Documentation/ABI: Add node performance attributes
  acpi/hmat: Register performance attributes
  node: Add memory caching attributes
  Documentation/ABI: Add node cache attributes
  acpi/hmat: Register memory side cache attributes
  doc/mm: New documentation for memory performance

 Documentation/ABI/stable/sysfs-devices-node   |  87 +++++-
 Documentation/admin-guide/mm/numaperf.rst     | 184 +++++++++++++
 arch/arm64/kernel/acpi_numa.c                 |   2 +-
 arch/arm64/kernel/smp.c                       |   4 +-
 arch/ia64/kernel/acpi.c                       |  12 +-
 arch/x86/kernel/acpi/boot.c                   |  36 +--
 drivers/acpi/Kconfig                          |   1 +
 drivers/acpi/Makefile                         |   1 +
 drivers/acpi/hmat/Kconfig                     |   9 +
 drivers/acpi/hmat/Makefile                    |   1 +
 drivers/acpi/hmat/hmat.c                      | 375 ++++++++++++++++++++++++++
 drivers/acpi/numa.c                           |  16 +-
 drivers/acpi/scan.c                           |   4 +-
 drivers/acpi/tables.c                         |  76 +++++-
 drivers/base/Kconfig                          |   8 +
 drivers/base/node.c                           | 317 +++++++++++++++++++++-
 drivers/irqchip/irq-gic-v2m.c                 |   2 +-
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
 drivers/irqchip/irq-gic-v3-its.c              |   6 +-
 drivers/irqchip/irq-gic-v3.c                  |  10 +-
 drivers/irqchip/irq-gic.c                     |   4 +-
 drivers/mailbox/pcc.c                         |   2 +-
 include/linux/acpi.h                          |   6 +-
 include/linux/node.h                          |  70 ++++-
 25 files changed, 1172 insertions(+), 65 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/numaperf.rst
 create mode 100644 drivers/acpi/hmat/Kconfig
 create mode 100644 drivers/acpi/hmat/Makefile
 create mode 100644 drivers/acpi/hmat/hmat.c

-- 
2.14.4


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv4 01/13] acpi: Create subtable parsing infrastructure
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
@ 2019-01-16 17:57 ` Keith Busch
  2019-01-16 17:57 ` [PATCHv4 02/13] acpi: Add HMAT to generic parsing tables Keith Busch
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Parsing entries in an ACPI table had assumed a generic header
structure. There is no standard ACPI header, though, so less common
layouts with different field sizes required custom parsers to go through
their subtable entry list.

Create the infrastructure for adding different table types so parsing
the entries array may be more reused for all ACPI system tables and
the common code doesn't need to be duplicated.

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 arch/arm64/kernel/acpi_numa.c                 |  2 +-
 arch/arm64/kernel/smp.c                       |  4 +-
 arch/ia64/kernel/acpi.c                       | 12 ++---
 arch/x86/kernel/acpi/boot.c                   | 36 +++++++-------
 drivers/acpi/numa.c                           | 16 +++----
 drivers/acpi/scan.c                           |  4 +-
 drivers/acpi/tables.c                         | 67 +++++++++++++++++++++++----
 drivers/irqchip/irq-gic-v2m.c                 |  2 +-
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |  2 +-
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |  2 +-
 drivers/irqchip/irq-gic-v3-its.c              |  6 +--
 drivers/irqchip/irq-gic-v3.c                  | 10 ++--
 drivers/irqchip/irq-gic.c                     |  4 +-
 drivers/mailbox/pcc.c                         |  2 +-
 include/linux/acpi.h                          |  5 +-
 15 files changed, 112 insertions(+), 62 deletions(-)

diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index eac1d0cc595c..7ff800045434 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -45,7 +45,7 @@ static inline int get_cpu_for_acpi_id(u32 uid)
 	return -EINVAL;
 }
 
-static int __init acpi_parse_gicc_pxm(struct acpi_subtable_header *header,
+static int __init acpi_parse_gicc_pxm(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
 	struct acpi_srat_gicc_affinity *pa;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 1598d6f7200a..e6a148604dcc 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -553,7 +553,7 @@ acpi_map_gic_cpu_interface(struct acpi_madt_generic_interrupt *processor)
 }
 
 static int __init
-acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header,
+acpi_parse_gic_cpu_interface(union acpi_subtable_headers *header,
 			     const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *processor;
@@ -562,7 +562,7 @@ acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header,
 	if (BAD_MADT_GICC_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	acpi_map_gic_cpu_interface(processor);
 
diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 41eb281709da..3973d2c2a9b0 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -177,7 +177,7 @@ struct acpi_table_madt *acpi_madt __initdata;
 static u8 has_8259;
 
 static int __init
-acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
+acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header,
 			  const unsigned long end)
 {
 	struct acpi_madt_local_apic_override *lapic;
@@ -216,7 +216,7 @@ acpi_parse_lsapic(struct acpi_subtable_header * header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic_nmi *lacpi_nmi;
 
@@ -230,7 +230,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e
 }
 
 static int __init
-acpi_parse_iosapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_iosapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_io_sapic *iosapic;
 
@@ -245,7 +245,7 @@ acpi_parse_iosapic(struct acpi_subtable_header * header, const unsigned long end
 static unsigned int __initdata acpi_madt_rev;
 
 static int __init
-acpi_parse_plat_int_src(struct acpi_subtable_header * header,
+acpi_parse_plat_int_src(union acpi_subtable_headers * header,
 			const unsigned long end)
 {
 	struct acpi_madt_interrupt_source *plintsrc;
@@ -329,7 +329,7 @@ unsigned int get_cpei_target_cpu(void)
 }
 
 static int __init
-acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
+acpi_parse_int_src_ovr(union acpi_subtable_headers * header,
 		       const unsigned long end)
 {
 	struct acpi_madt_interrupt_override *p;
@@ -350,7 +350,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_nmi_source *nmi_src;
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 2624de16cd7a..b694a32f95d4 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -197,7 +197,7 @@ static int acpi_register_lapic(int id, u32 acpiid, u8 enabled)
 }
 
 static int __init
-acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
+acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
 {
 	struct acpi_madt_local_x2apic *processor = NULL;
 #ifdef CONFIG_X86_X2APIC
@@ -210,7 +210,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 #ifdef CONFIG_X86_X2APIC
 	apic_id = processor->local_apic_id;
@@ -242,7 +242,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic *processor = NULL;
 
@@ -251,7 +251,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* Ignore invalid ID */
 	if (processor->id == 0xff)
@@ -272,7 +272,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
+acpi_parse_sapic(union acpi_subtable_headers *header, const unsigned long end)
 {
 	struct acpi_madt_local_sapic *processor = NULL;
 
@@ -281,7 +281,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	acpi_register_lapic((processor->id << 8) | processor->eid,/* APIC ID */
 			    processor->processor_id, /* ACPI ID */
@@ -291,7 +291,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
 }
 
 static int __init
-acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
+acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header,
 			  const unsigned long end)
 {
 	struct acpi_madt_local_apic_override *lapic_addr_ovr = NULL;
@@ -301,7 +301,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
 	if (BAD_MADT_ENTRY(lapic_addr_ovr, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	acpi_lapic_addr = lapic_addr_ovr->address;
 
@@ -309,7 +309,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
+acpi_parse_x2apic_nmi(union acpi_subtable_headers *header,
 		      const unsigned long end)
 {
 	struct acpi_madt_local_x2apic_nmi *x2apic_nmi = NULL;
@@ -319,7 +319,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
 	if (BAD_MADT_ENTRY(x2apic_nmi, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (x2apic_nmi->lint != 1)
 		printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n");
@@ -328,7 +328,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_local_apic_nmi *lapic_nmi = NULL;
 
@@ -337,7 +337,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e
 	if (BAD_MADT_ENTRY(lapic_nmi, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (lapic_nmi->lint != 1)
 		printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n");
@@ -449,7 +449,7 @@ static int __init mp_register_ioapic_irq(u8 bus_irq, u8 polarity,
 }
 
 static int __init
-acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_ioapic(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_io_apic *ioapic = NULL;
 	struct ioapic_domain_cfg cfg = {
@@ -462,7 +462,7 @@ acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end)
 	if (BAD_MADT_ENTRY(ioapic, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* Statically assign IRQ numbers for IOAPICs hosting legacy IRQs */
 	if (ioapic->global_irq_base < nr_legacy_irqs())
@@ -508,7 +508,7 @@ static void __init acpi_sci_ioapic_setup(u8 bus_irq, u16 polarity, u16 trigger,
 }
 
 static int __init
-acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
+acpi_parse_int_src_ovr(union acpi_subtable_headers * header,
 		       const unsigned long end)
 {
 	struct acpi_madt_interrupt_override *intsrc = NULL;
@@ -518,7 +518,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 	if (BAD_MADT_ENTRY(intsrc, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	if (intsrc->source_irq == acpi_gbl_FADT.sci_interrupt) {
 		acpi_sci_ioapic_setup(intsrc->source_irq,
@@ -550,7 +550,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
 }
 
 static int __init
-acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end)
 {
 	struct acpi_madt_nmi_source *nmi_src = NULL;
 
@@ -559,7 +559,7 @@ acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end
 	if (BAD_MADT_ENTRY(nmi_src, end))
 		return -EINVAL;
 
-	acpi_table_print_madt_entry(header);
+	acpi_table_print_madt_entry(&header->common);
 
 	/* TBD: Support nimsrc entries? */
 
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 7bbbf8256a41..d6433b14864c 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -338,7 +338,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
 }
 
 static int __init
-acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
+acpi_parse_x2apic_affinity(union acpi_subtable_headers *header,
 			   const unsigned long end)
 {
 	struct acpi_srat_x2apic_cpu_affinity *processor_affinity;
@@ -347,7 +347,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_x2apic_affinity_init(processor_affinity);
@@ -356,7 +356,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_processor_affinity(struct acpi_subtable_header *header,
+acpi_parse_processor_affinity(union acpi_subtable_headers *header,
 			      const unsigned long end)
 {
 	struct acpi_srat_cpu_affinity *processor_affinity;
@@ -365,7 +365,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_processor_affinity_init(processor_affinity);
@@ -374,7 +374,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header,
 }
 
 static int __init
-acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
+acpi_parse_gicc_affinity(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	struct acpi_srat_gicc_affinity *processor_affinity;
@@ -383,7 +383,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
 	if (!processor_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	acpi_numa_gicc_affinity_init(processor_affinity);
@@ -394,7 +394,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
 static int __initdata parsed_numa_memblks;
 
 static int __init
-acpi_parse_memory_affinity(struct acpi_subtable_header * header,
+acpi_parse_memory_affinity(union acpi_subtable_headers * header,
 			   const unsigned long end)
 {
 	struct acpi_srat_mem_affinity *memory_affinity;
@@ -403,7 +403,7 @@ acpi_parse_memory_affinity(struct acpi_subtable_header * header,
 	if (!memory_affinity)
 		return -EINVAL;
 
-	acpi_table_print_srat_entry(header);
+	acpi_table_print_srat_entry(&header->common);
 
 	/* let architecture-dependent part to do it */
 	if (!acpi_numa_memory_affinity_init(memory_affinity))
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 5efd4219f112..a894ce3556f2 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -2240,10 +2240,10 @@ static struct acpi_probe_entry *ape;
 static int acpi_probe_count;
 static DEFINE_MUTEX(acpi_probe_mutex);
 
-static int __init acpi_match_madt(struct acpi_subtable_header *header,
+static int __init acpi_match_madt(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
-	if (!ape->subtable_valid || ape->subtable_valid(header, ape))
+	if (!ape->subtable_valid || ape->subtable_valid(&header->common, ape))
 		if (!ape->probe_subtbl(header, end))
 			acpi_probe_count++;
 
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 48eabb6c2d4f..967e1168becf 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -49,6 +49,15 @@ static struct acpi_table_desc initial_tables[ACPI_MAX_TABLES] __initdata;
 
 static int acpi_apic_instance __initdata;
 
+enum acpi_subtable_type {
+	ACPI_SUBTABLE_COMMON,
+};
+
+struct acpi_subtable_entry {
+	union acpi_subtable_headers *hdr;
+	enum acpi_subtable_type type;
+};
+
 /*
  * Disable table checksum verification for the early stage due to the size
  * limitation of the current x86 early mapping implementation.
@@ -217,6 +226,42 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
 	}
 }
 
+static unsigned long __init
+acpi_get_entry_type(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return entry->hdr->common.type;
+	}
+	return 0;
+}
+
+static unsigned long __init
+acpi_get_entry_length(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return entry->hdr->common.length;
+	}
+	return 0;
+}
+
+static unsigned long __init
+acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return sizeof(entry->hdr->common);
+	}
+	return 0;
+}
+
+static enum acpi_subtable_type __init
+acpi_get_subtable_type(char *id)
+{
+	return ACPI_SUBTABLE_COMMON;
+}
+
 /**
  * acpi_parse_entries_array - for each proc_num find a suitable subtable
  *
@@ -246,8 +291,8 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 		struct acpi_subtable_proc *proc, int proc_num,
 		unsigned int max_entries)
 {
-	struct acpi_subtable_header *entry;
-	unsigned long table_end;
+	struct acpi_subtable_entry entry;
+	unsigned long table_end, subtable_len, entry_len;
 	int count = 0;
 	int errs = 0;
 	int i;
@@ -270,19 +315,20 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 
 	/* Parse all entries looking for a match. */
 
-	entry = (struct acpi_subtable_header *)
+	entry.type = acpi_get_subtable_type(id);
+	entry.hdr = (union acpi_subtable_headers *)
 	    ((unsigned long)table_header + table_size);
+	subtable_len = acpi_get_subtable_header_length(&entry);
 
-	while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
-	       table_end) {
+	while (((unsigned long)entry.hdr) + subtable_len  < table_end) {
 		if (max_entries && count >= max_entries)
 			break;
 
 		for (i = 0; i < proc_num; i++) {
-			if (entry->type != proc[i].id)
+			if (acpi_get_entry_type(&entry) != proc[i].id)
 				continue;
 			if (!proc[i].handler ||
-			     (!errs && proc[i].handler(entry, table_end))) {
+			     (!errs && proc[i].handler(entry.hdr, table_end))) {
 				errs++;
 				continue;
 			}
@@ -297,13 +343,14 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 		 * If entry->length is 0, break from this loop to avoid
 		 * infinite loop.
 		 */
-		if (entry->length == 0) {
+		entry_len = acpi_get_entry_length(&entry);
+		if (entry_len == 0) {
 			pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
 			return -EINVAL;
 		}
 
-		entry = (struct acpi_subtable_header *)
-		    ((unsigned long)entry + entry->length);
+		entry.hdr = (union acpi_subtable_headers *)
+		    ((unsigned long)entry.hdr + entry_len);
 	}
 
 	if (max_entries && count > max_entries) {
diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
index f5fe0100f9ff..de14e06fd9ec 100644
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -446,7 +446,7 @@ static struct fwnode_handle *gicv2m_get_fwnode(struct device *dev)
 }
 
 static int __init
-acpi_parse_madt_msi(struct acpi_subtable_header *header,
+acpi_parse_madt_msi(union acpi_subtable_headers *header,
 		    const unsigned long end)
 {
 	int ret;
diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
index 8d6d009d1d58..c81d5b81da56 100644
--- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
@@ -159,7 +159,7 @@ static int __init its_pci_of_msi_init(void)
 #ifdef CONFIG_ACPI
 
 static int __init
-its_pci_msi_parse_madt(struct acpi_subtable_header *header,
+its_pci_msi_parse_madt(union acpi_subtable_headers *header,
 		       const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3-its-platform-msi.c b/drivers/irqchip/irq-gic-v3-its-platform-msi.c
index 7b8e87b493fe..9cdcda5bb3bd 100644
--- a/drivers/irqchip/irq-gic-v3-its-platform-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-platform-msi.c
@@ -117,7 +117,7 @@ static int __init its_pmsi_init_one(struct fwnode_handle *fwnode,
 
 #ifdef CONFIG_ACPI
 static int __init
-its_pmsi_parse_madt(struct acpi_subtable_header *header,
+its_pmsi_parse_madt(union acpi_subtable_headers *header,
 			const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index db20e992a40f..d6677075d68f 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -3764,13 +3764,13 @@ static int __init acpi_get_its_numa_node(u32 its_id)
 	return NUMA_NO_NODE;
 }
 
-static int __init gic_acpi_match_srat_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_srat_its(union acpi_subtable_headers *header,
 					  const unsigned long end)
 {
 	return 0;
 }
 
-static int __init gic_acpi_parse_srat_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_parse_srat_its(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	int node;
@@ -3837,7 +3837,7 @@ static int __init acpi_get_its_numa_node(u32 its_id) { return NUMA_NO_NODE; }
 static void __init acpi_its_srat_maps_free(void) { }
 #endif
 
-static int __init gic_acpi_parse_madt_its(struct acpi_subtable_header *header,
+static int __init gic_acpi_parse_madt_its(union acpi_subtable_headers *header,
 					  const unsigned long end)
 {
 	struct acpi_madt_generic_translator *its_entry;
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 0868a9d81c3c..44db6b809c52 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1392,7 +1392,7 @@ gic_acpi_register_redist(phys_addr_t phys_base, void __iomem *redist_base)
 }
 
 static int __init
-gic_acpi_parse_madt_redist(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_redist(union acpi_subtable_headers *header,
 			   const unsigned long end)
 {
 	struct acpi_madt_generic_redistributor *redist =
@@ -1410,7 +1410,7 @@ gic_acpi_parse_madt_redist(struct acpi_subtable_header *header,
 }
 
 static int __init
-gic_acpi_parse_madt_gicc(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_gicc(union acpi_subtable_headers *header,
 			 const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *gicc =
@@ -1452,14 +1452,14 @@ static int __init gic_acpi_collect_gicr_base(void)
 	return -ENODEV;
 }
 
-static int __init gic_acpi_match_gicr(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_gicr(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
 	/* Subtable presence means that redist exists, that's it */
 	return 0;
 }
 
-static int __init gic_acpi_match_gicc(struct acpi_subtable_header *header,
+static int __init gic_acpi_match_gicc(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *gicc =
@@ -1525,7 +1525,7 @@ static bool __init acpi_validate_gic_table(struct acpi_subtable_header *header,
 	return true;
 }
 
-static int __init gic_acpi_parse_virt_madt_gicc(struct acpi_subtable_header *header,
+static int __init gic_acpi_parse_virt_madt_gicc(union acpi_subtable_headers *header,
 						const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *gicc =
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index ba2a37a27a54..a749d73f8337 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -1508,7 +1508,7 @@ static struct
 } acpi_data __initdata;
 
 static int __init
-gic_acpi_parse_madt_cpu(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_cpu(union acpi_subtable_headers *header,
 			const unsigned long end)
 {
 	struct acpi_madt_generic_interrupt *processor;
@@ -1540,7 +1540,7 @@ gic_acpi_parse_madt_cpu(struct acpi_subtable_header *header,
 }
 
 /* The things you have to do to just *count* something... */
-static int __init acpi_dummy_func(struct acpi_subtable_header *header,
+static int __init acpi_dummy_func(union acpi_subtable_headers *header,
 				  const unsigned long end)
 {
 	return 0;
diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index 256f18b67e8a..08a0a3517138 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -382,7 +382,7 @@ static const struct mbox_chan_ops pcc_chan_ops = {
  *
  * This gets called for each entry in the PCC table.
  */
-static int parse_pcc_subspace(struct acpi_subtable_header *header,
+static int parse_pcc_subspace(union acpi_subtable_headers *header,
 		const unsigned long end)
 {
 	struct acpi_pcct_subspace *ss = (struct acpi_pcct_subspace *) header;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 87715f20b69a..7c3c4ebaded6 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -141,10 +141,13 @@ enum acpi_address_range_id {
 
 
 /* Table Handlers */
+union acpi_subtable_headers {
+	struct acpi_subtable_header common;
+};
 
 typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
 
-typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
+typedef int (*acpi_tbl_entry_handler)(union acpi_subtable_headers *header,
 				      const unsigned long end);
 
 /* Debugger support */
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 02/13] acpi: Add HMAT to generic parsing tables
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
  2019-01-16 17:57 ` [PATCHv4 01/13] acpi: Create subtable parsing infrastructure Keith Busch
@ 2019-01-16 17:57 ` Keith Busch
  2019-01-16 17:57 ` [PATCHv4 03/13] acpi/hmat: Parse and report heterogeneous memory Keith Busch
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

The Heterogeneous Memory Attribute Table (HMAT) header has different
field lengths than the existing parsing uses. Add the HMAT type to the
parsing rules so it may be generically parsed.

Cc: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/tables.c | 9 +++++++++
 include/linux/acpi.h  | 1 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 967e1168becf..d9911cd55edc 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -51,6 +51,7 @@ static int acpi_apic_instance __initdata;
 
 enum acpi_subtable_type {
 	ACPI_SUBTABLE_COMMON,
+	ACPI_SUBTABLE_HMAT,
 };
 
 struct acpi_subtable_entry {
@@ -232,6 +233,8 @@ acpi_get_entry_type(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return entry->hdr->common.type;
+	case ACPI_SUBTABLE_HMAT:
+		return entry->hdr->hmat.type;
 	}
 	return 0;
 }
@@ -242,6 +245,8 @@ acpi_get_entry_length(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return entry->hdr->common.length;
+	case ACPI_SUBTABLE_HMAT:
+		return entry->hdr->hmat.length;
 	}
 	return 0;
 }
@@ -252,6 +257,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return sizeof(entry->hdr->common);
+	case ACPI_SUBTABLE_HMAT:
+		return sizeof(entry->hdr->hmat);
 	}
 	return 0;
 }
@@ -259,6 +266,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
 static enum acpi_subtable_type __init
 acpi_get_subtable_type(char *id)
 {
+	if (strncmp(id, ACPI_SIG_HMAT, 4) == 0)
+		return ACPI_SUBTABLE_HMAT;
 	return ACPI_SUBTABLE_COMMON;
 }
 
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 7c3c4ebaded6..53f93dff171c 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -143,6 +143,7 @@ enum acpi_address_range_id {
 /* Table Handlers */
 union acpi_subtable_headers {
 	struct acpi_subtable_header common;
+	struct acpi_hmat_structure hmat;
 };
 
 typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 03/13] acpi/hmat: Parse and report heterogeneous memory
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
  2019-01-16 17:57 ` [PATCHv4 01/13] acpi: Create subtable parsing infrastructure Keith Busch
  2019-01-16 17:57 ` [PATCHv4 02/13] acpi: Add HMAT to generic parsing tables Keith Busch
@ 2019-01-16 17:57 ` Keith Busch
  2019-01-17 11:00   ` Rafael J. Wysocki
  2019-01-16 17:57 ` [PATCHv4 04/13] node: Link memory nodes to their compute nodes Keith Busch
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Systems may provide different memory types and export this information
in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
tables provided by the platform and report the memory access and caching
attributes.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/Kconfig       |   1 +
 drivers/acpi/Makefile      |   1 +
 drivers/acpi/hmat/Kconfig  |   8 ++
 drivers/acpi/hmat/Makefile |   1 +
 drivers/acpi/hmat/hmat.c   | 180 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 191 insertions(+)
 create mode 100644 drivers/acpi/hmat/Kconfig
 create mode 100644 drivers/acpi/hmat/Makefile
 create mode 100644 drivers/acpi/hmat/hmat.c

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 90ff0a47c12e..b377f970adfd 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -465,6 +465,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
 	  If you are unsure what to do, do not enable this option.
 
 source "drivers/acpi/nfit/Kconfig"
+source "drivers/acpi/hmat/Kconfig"
 
 source "drivers/acpi/apei/Kconfig"
 source "drivers/acpi/dptf/Kconfig"
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 7c6afc111d76..bff8fbe5a6ab 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_ACPI_PROCESSOR)	+= processor.o
 obj-$(CONFIG_ACPI)		+= container.o
 obj-$(CONFIG_ACPI_THERMAL)	+= thermal.o
 obj-$(CONFIG_ACPI_NFIT)		+= nfit/
+obj-$(CONFIG_ACPI_HMAT)		+= hmat/
 obj-$(CONFIG_ACPI)		+= acpi_memhotplug.o
 obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
 obj-$(CONFIG_ACPI_BATTERY)	+= battery.o
diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
new file mode 100644
index 000000000000..a4034d37a311
--- /dev/null
+++ b/drivers/acpi/hmat/Kconfig
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+config ACPI_HMAT
+	bool "ACPI Heterogeneous Memory Attribute Table Support"
+	depends on ACPI_NUMA
+	help
+	 Parses representation of the ACPI Heterogeneous Memory Attributes
+	 Table (HMAT) and set the memory node relationships and access
+	 attributes.
diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile
new file mode 100644
index 000000000000..e909051d3d00
--- /dev/null
+++ b/drivers/acpi/hmat/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_ACPI_HMAT) := hmat.o
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
new file mode 100644
index 000000000000..833a783868d5
--- /dev/null
+++ b/drivers/acpi/hmat/hmat.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Heterogeneous Memory Attributes Table (HMAT) representation
+ *
+ * Copyright (c) 2018, Intel Corporation.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/bitops.h>
+#include <linux/cpu.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/node.h>
+#include <linux/slab.h>
+#include <linux/sysfs.h>
+
+static __init const char *hmat_data_type(u8 type)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+		return "Access Latency";
+	case ACPI_HMAT_READ_LATENCY:
+		return "Read Latency";
+	case ACPI_HMAT_WRITE_LATENCY:
+		return "Write Latency";
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+		return "Access Bandwidth";
+	case ACPI_HMAT_READ_BANDWIDTH:
+		return "Read Bandwidth";
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		return "Write Bandwidth";
+	default:
+		return "Reserved";
+	};
+}
+
+static __init const char *hmat_data_type_suffix(u8 type)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+	case ACPI_HMAT_READ_LATENCY:
+	case ACPI_HMAT_WRITE_LATENCY:
+		return " nsec";
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+	case ACPI_HMAT_READ_BANDWIDTH:
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		return " MB/s";
+	default:
+		return "";
+	};
+}
+
+static __init int hmat_parse_locality(union acpi_subtable_headers *header,
+				      const unsigned long end)
+{
+	struct acpi_hmat_locality *loc = (void *)header;
+	unsigned int init, targ, total_size, ipds, tpds;
+	u32 *inits, *targs, value;
+	u16 *entries;
+	u8 type;
+
+	if (loc->header.length < sizeof(*loc)) {
+		pr_err("HMAT: Unexpected locality header length: %d\n",
+			loc->header.length);
+		return -EINVAL;
+	}
+
+	type = loc->data_type;
+	ipds = loc->number_of_initiator_Pds;
+	tpds = loc->number_of_target_Pds;
+	total_size = sizeof(*loc) + sizeof(*entries) * ipds * tpds +
+		     sizeof(*inits) * ipds + sizeof(*targs) * tpds;
+	if (loc->header.length < total_size) {
+		pr_err("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
+			loc->header.length, total_size);
+		return -EINVAL;
+	}
+
+	pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
+		loc->flags, hmat_data_type(type), ipds, tpds,
+		loc->entry_base_unit);
+
+	inits = (u32 *)(loc + 1);
+	targs = &inits[ipds];
+	entries = (u16 *)(&targs[tpds]);
+	for (targ = 0; targ < tpds; targ++) {
+		for (init = 0; init < ipds; init++) {
+			value = entries[init * tpds + targ];
+			value = (value * loc->entry_base_unit) / 10;
+			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
+				inits[init], targs[targ], value,
+				hmat_data_type_suffix(type));
+		}
+	}
+	return 0;
+}
+
+static __init int hmat_parse_cache(union acpi_subtable_headers *header,
+				   const unsigned long end)
+{
+	struct acpi_hmat_cache *cache = (void *)header;
+	u32 attrs;
+
+	if (cache->header.length < sizeof(*cache)) {
+		pr_err("HMAT: Unexpected cache header length: %d\n",
+			cache->header.length);
+		return -EINVAL;
+	}
+
+	attrs = cache->cache_attributes;
+	pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
+		cache->memory_PD, cache->cache_size, attrs,
+		cache->number_of_SMBIOShandles);
+
+	return 0;
+}
+
+static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
+					   const unsigned long end)
+{
+	struct acpi_hmat_address_range *spa = (void *)header;
+
+	if (spa->header.length != sizeof(*spa)) {
+		pr_err("HMAT: Unexpected address range header length: %d\n",
+			spa->header.length);
+		return -EINVAL;
+	}
+	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
+		spa->physical_address_base, spa->physical_address_length,
+		spa->flags, spa->processor_PD, spa->memory_PD);
+	return 0;
+}
+
+static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
+				      const unsigned long end)
+{
+	struct acpi_hmat_structure *hdr = (void *)header;
+
+	if (!hdr)
+		return -EINVAL;
+
+	switch (hdr->type) {
+	case ACPI_HMAT_TYPE_ADDRESS_RANGE:
+		return hmat_parse_address_range(header, end);
+	case ACPI_HMAT_TYPE_LOCALITY:
+		return hmat_parse_locality(header, end);
+	case ACPI_HMAT_TYPE_CACHE:
+		return hmat_parse_cache(header, end);
+	default:
+		return -EINVAL;
+	}
+}
+
+static __init int hmat_init(void)
+{
+	struct acpi_table_header *tbl;
+	enum acpi_hmat_type i;
+	acpi_status status;
+
+	if (srat_disabled())
+		return 0;
+
+	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+	if (ACPI_FAILURE(status))
+		return 0;
+
+	for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
+		if (acpi_table_parse_entries(ACPI_SIG_HMAT,
+					     sizeof(struct acpi_table_hmat), i,
+					     hmat_parse_subtable, 0) < 0)
+			goto out_put;
+	}
+out_put:
+	acpi_put_table(tbl);
+	return 0;
+}
+subsys_initcall(hmat_init);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 04/13] node: Link memory nodes to their compute nodes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (2 preceding siblings ...)
  2019-01-16 17:57 ` [PATCHv4 03/13] acpi/hmat: Parse and report heterogeneous memory Keith Busch
@ 2019-01-16 17:57 ` Keith Busch
  2019-01-17 11:26   ` Rafael J. Wysocki
  2019-01-16 17:57 ` [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes Keith Busch
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Systems may be constructed with various specialized nodes. Some nodes
may provide memory, some provide compute devices that access and use
that memory, and others may provide both. Nodes that provide memory are
referred to as memory targets, and nodes that can initiate memory access
are referred to as memory initiators.

Memory targets will often have varying access characteristics from
different initiators, and platforms may have ways to express those
relationships. In preparation for these systems, provide interfaces
for the kernel to export the memory relationship among different nodes
memory targets and their initiators with symlinks to each other's nodes,
and export node lists showing the same relationship.

If a system provides access locality for each initiator-target pair, nodes
may be grouped into ranked access classes relative to other nodes. The new
interface allows a subsystem to register relationships of varying classes
if available and desired to be exported. A lower class number indicates
a higher performing tier, with 0 being the best performing class.

A memory initiator may have multiple memory targets in the same access
class. The initiator's memory targets in given class indicate the node's
access characteristics perform better relative to other initiator nodes
either unreported or in lower class numbers. The targets within an
initiator's class, though, do not necessarily perform the same as each
other.

A memory target node may have multiple memory initiators. All linked
initiators in a target's class have the same access characteristics to
that target.

The following example show the nodes' new sysfs hierarchy for a memory
target node 'Y' with class 0 access from initiator node 'X':

  # symlinks -v /sys/devices/system/node/nodeX/class0/
  relative: /sys/devices/system/node/nodeX/class0/targetY -> ../../nodeY

  # symlinks -v /sys/devices/system/node/nodeY/class0/
  relative: /sys/devices/system/node/nodeY/class0/initiatorX -> ../../nodeX

And the same information is reflected in the nodelist:

  # cat /sys/devices/system/node/nodeX/class0/target_nodelist
  Y

  # cat /sys/devices/system/node/nodeY/class0/initiator_nodelist
  X

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/node.c  | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/node.h |   6 ++-
 2 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..1da5072116ab 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -17,6 +17,7 @@
 #include <linux/nodemask.h>
 #include <linux/cpu.h>
 #include <linux/device.h>
+#include <linux/pm_runtime.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
 
@@ -59,6 +60,91 @@ static inline ssize_t node_read_cpulist(struct device *dev,
 static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
+struct node_class_nodes {
+	struct device		dev;
+	struct list_head	list_node;
+	unsigned		class;
+	nodemask_t		initiator_nodes;
+	nodemask_t		target_nodes;
+};
+#define to_class_nodes(dev) container_of(dev, struct node_class_nodes, dev)
+
+static ssize_t initiator_nodelist_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct node_class_nodes *c = to_class_nodes(dev);
+	return scnprintf(buf, PAGE_SIZE - 1, "%*pbl\n",
+			 nodemask_pr_args(&c->initiator_nodes));
+}
+static DEVICE_ATTR_RO(initiator_nodelist);
+
+static ssize_t target_nodelist_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct node_class_nodes *c = to_class_nodes(dev);
+	return scnprintf(buf, PAGE_SIZE - 1, "%*pbl\n",
+			 nodemask_pr_args(&c->target_nodes));
+}
+static DEVICE_ATTR_RO(target_nodelist);
+
+static struct attribute *node_class_node_attrs[] = {
+	&dev_attr_initiator_nodelist.attr,
+	&dev_attr_target_nodelist.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(node_class_node);
+
+static void node_remove_classes(struct node *node)
+{
+	struct node_class_nodes *c, *cnext;
+
+	list_for_each_entry_safe(c, cnext, &node->class_list, list_node) {
+		list_del(&c->list_node);
+		device_unregister(&c->dev);
+	}
+}
+
+static void node_class_release(struct device *dev)
+{
+	kfree(to_class_nodes(dev));
+}
+
+static struct node_class_nodes *node_init_node_class(struct device *parent,
+						     struct list_head *list,
+						     unsigned class)
+{
+	struct node_class_nodes *class_node;
+	struct device *dev;
+
+	list_for_each_entry(class_node, list, list_node)
+		if (class_node->class == class)
+			return class_node;
+
+	class_node = kzalloc(sizeof(*class_node), GFP_KERNEL);
+	if (!class_node)
+		return NULL;
+
+	class_node->class = class;
+	dev = &class_node->dev;
+	dev->parent = parent;
+	dev->release = node_class_release;
+	dev->groups = node_class_node_groups;
+	if (dev_set_name(dev, "class%u", class))
+		goto free;
+
+	if (device_register(dev))
+		goto free_name;
+
+	pm_runtime_no_callbacks(dev);
+	list_add_tail(&class_node->list_node, list);
+	return class_node;
+free_name:
+	kfree_const(dev->kobj.name);
+free:
+	kfree(class_node);
+	return NULL;
+}
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
@@ -340,7 +426,7 @@ static int register_node(struct node *node, int num)
 void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
-
+	node_remove_classes(node);
 	device_unregister(&node->dev);
 }
 
@@ -372,6 +458,44 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
 				 kobject_name(&node_devices[nid]->dev.kobj));
 }
 
+int register_memory_node_under_compute_node(unsigned int m, unsigned int p,
+					    unsigned class)
+{
+	struct node *init, *targ;
+	struct node_class_nodes *i, *t;
+	char initiator[20]; /* "initiator4294967295\0" */
+	char target[17];    /* "target4294967295\0" */
+	int ret;
+
+	if (!node_online(p) || !node_online(m))
+		return -ENODEV;
+
+	init = node_devices[p];
+	targ = node_devices[m];
+	i = node_init_node_class(&init->dev, &init->class_list, class);
+	t = node_init_node_class(&targ->dev, &targ->class_list, class);
+	if (!i || !t)
+		return -ENOMEM;
+
+	snprintf(initiator, sizeof(initiator), "initiator%u", p);
+	snprintf(target, sizeof(target), "target%u", m);
+	ret = sysfs_create_link(&i->dev.kobj, &targ->dev.kobj, target);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_link(&t->dev.kobj, &init->dev.kobj, initiator);
+	if (ret)
+		goto err;
+
+	node_set(m, i->target_nodes);
+	node_set(p, t->initiator_nodes);
+	return 0;
+ err:
+	sysfs_remove_link(&node_devices[p]->dev.kobj,
+			  kobject_name(&node_devices[m]->dev.kobj));
+	return ret;
+}
+
 int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 {
 	struct device *obj;
@@ -580,6 +704,7 @@ int __register_one_node(int nid)
 			register_cpu_under_node(cpu, nid);
 	}
 
+	INIT_LIST_HEAD(&node_devices[nid]->class_list);
 	/* initialize work queue for memory hot plug */
 	init_node_hugetlb_work(nid);
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..8e3666c12ef2 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -17,11 +17,12 @@
 
 #include <linux/device.h>
 #include <linux/cpumask.h>
+#include <linux/list.h>
 #include <linux/workqueue.h>
 
 struct node {
 	struct device	dev;
-
+	struct list_head class_list;
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
@@ -75,6 +76,9 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
+extern int register_memory_node_under_compute_node(unsigned int m, unsigned int p,
+						   unsigned class);
+
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
 					 node_registration_func_t unregister);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (3 preceding siblings ...)
  2019-01-16 17:57 ` [PATCHv4 04/13] node: Link memory nodes to their compute nodes Keith Busch
@ 2019-01-16 17:57 ` Keith Busch
  2019-01-17 11:41   ` Rafael J. Wysocki
  2019-01-18 11:21   ` Jonathan Cameron
  2019-01-16 17:57 ` [PATCHv4 06/13] acpi/hmat: Register processor domain to its memory Keith Busch
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Add entries for memory initiator and target node class attributes.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 3e90e1f3bf0a..a9c47b4b0eee 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -90,4 +90,27 @@ Date:		December 2009
 Contact:	Lee Schermerhorn <lee.schermerhorn@hp.com>
 Description:
 		The node's huge page size control/query attributes.
-		See Documentation/admin-guide/mm/hugetlbpage.rst
\ No newline at end of file
+		See Documentation/admin-guide/mm/hugetlbpage.rst
+
+What:		/sys/devices/system/node/nodeX/classY/
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The node's relationship to other nodes for access class "Y".
+
+What:		/sys/devices/system/node/nodeX/classY/initiator_nodelist
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The node list of memory initiators that have class "Y" access
+		to this node's memory. CPUs and other memory initiators in
+		nodes not in the list accessing this node's memory may have
+		different performance.
+
+What:		/sys/devices/system/node/nodeX/classY/target_nodelist
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The node list of memory targets that this initiator node has
+		class "Y" access. Memory accesses from this node to nodes not
+		in this list may have differet performance.
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 06/13] acpi/hmat: Register processor domain to its memory
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (4 preceding siblings ...)
  2019-01-16 17:57 ` [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes Keith Busch
@ 2019-01-16 17:57 ` Keith Busch
  2019-01-17 12:11   ` Rafael J. Wysocki
  2019-01-16 17:57 ` [PATCHv4 07/13] node: Add heterogenous memory access attributes Keith Busch
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

If the HMAT Subsystem Address Range provides a valid processor proximity
domain for a memory domain, or a processor domain with the highest
performing access exists, register the memory target with that initiator
so this relationship will be visible under the node's sysfs directory.

Since HMAT requires valid address ranges have an equivalent SRAT entry,
verify each memory target satisfies this requirement.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/hmat/hmat.c | 143 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 136 insertions(+), 7 deletions(-)

diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index 833a783868d5..efb33c74d1a3 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -17,6 +17,43 @@
 #include <linux/slab.h>
 #include <linux/sysfs.h>
 
+static LIST_HEAD(targets);
+
+struct memory_target {
+	struct list_head node;
+	unsigned int memory_pxm;
+	unsigned long p_nodes[BITS_TO_LONGS(MAX_NUMNODES)];
+};
+
+static __init struct memory_target *find_mem_target(unsigned int m)
+{
+	struct memory_target *t;
+
+	list_for_each_entry(t, &targets, node)
+		if (t->memory_pxm == m)
+			return t;
+	return NULL;
+}
+
+static __init void alloc_memory_target(unsigned int mem_pxm)
+{
+	struct memory_target *t;
+
+	if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
+		return;
+
+	t = find_mem_target(mem_pxm);
+	if (t)
+		return;
+
+	t = kzalloc(sizeof(*t), GFP_KERNEL);
+	if (!t)
+		return;
+
+	t->memory_pxm = mem_pxm;
+	list_add_tail(&t->node, &targets);
+}
+
 static __init const char *hmat_data_type(u8 type)
 {
 	switch (type) {
@@ -53,11 +90,30 @@ static __init const char *hmat_data_type_suffix(u8 type)
 	};
 }
 
+static __init void hmat_update_access(u8 type, u32 value, u32 *best)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+	case ACPI_HMAT_READ_LATENCY:
+	case ACPI_HMAT_WRITE_LATENCY:
+		if (!*best || *best > value)
+			*best = value;
+		break;
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+	case ACPI_HMAT_READ_BANDWIDTH:
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		if (!*best || *best < value)
+			*best = value;
+		break;
+	}
+}
+
 static __init int hmat_parse_locality(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
+	struct memory_target *t;
 	struct acpi_hmat_locality *loc = (void *)header;
-	unsigned int init, targ, total_size, ipds, tpds;
+	unsigned int init, targ, pass, p_node, total_size, ipds, tpds;
 	u32 *inits, *targs, value;
 	u16 *entries;
 	u8 type;
@@ -87,12 +143,28 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
 	targs = &inits[ipds];
 	entries = (u16 *)(&targs[tpds]);
 	for (targ = 0; targ < tpds; targ++) {
-		for (init = 0; init < ipds; init++) {
-			value = entries[init * tpds + targ];
-			value = (value * loc->entry_base_unit) / 10;
-			pr_info("  Initiator-Target[%d-%d]:%d%s\n",
-				inits[init], targs[targ], value,
-				hmat_data_type_suffix(type));
+		u32 best = 0;
+
+		t = find_mem_target(targs[targ]);
+		for (pass = 0; pass < 2; pass++) {
+			for (init = 0; init < ipds; init++) {
+				value = entries[init * tpds + targ];
+				value = (value * loc->entry_base_unit) / 10;
+
+				if (!pass) {
+					hmat_update_access(type, value, &best);
+					pr_info("  Initiator-Target[%d-%d]:%d%s\n",
+						inits[init], targs[targ], value,
+						hmat_data_type_suffix(type));
+					continue;
+				}
+
+				if (!t)
+					continue;
+				p_node = pxm_to_node(inits[init]);
+				if (p_node != NUMA_NO_NODE && value == best)
+					set_bit(p_node, t->p_nodes);
+			}
 		}
 	}
 	return 0;
@@ -122,6 +194,7 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
 					   const unsigned long end)
 {
 	struct acpi_hmat_address_range *spa = (void *)header;
+	struct memory_target *t = NULL;
 
 	if (spa->header.length != sizeof(*spa)) {
 		pr_err("HMAT: Unexpected address range header length: %d\n",
@@ -131,6 +204,23 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
 	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
 		spa->physical_address_base, spa->physical_address_length,
 		spa->flags, spa->processor_PD, spa->memory_PD);
+
+	if (spa->flags & ACPI_HMAT_MEMORY_PD_VALID) {
+		t = find_mem_target(spa->memory_PD);
+		if (!t) {
+			pr_warn("HMAT: Memory Domain missing from SRAT\n");
+			return -EINVAL;
+		}
+	}
+	if (t && spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
+		int p_node = pxm_to_node(spa->processor_PD);
+
+		if (p_node == NUMA_NO_NODE) {
+			pr_warn("HMAT: Invalid Processor Domain\n");
+			return -EINVAL;
+		}
+		set_bit(p_node, t->p_nodes);
+	}
 	return 0;
 }
 
@@ -154,6 +244,33 @@ static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
 	}
 }
 
+static __init int srat_parse_mem_affinity(union acpi_subtable_headers *header,
+					  const unsigned long end)
+{
+	struct acpi_srat_mem_affinity *ma = (void *)header;
+
+	if (!ma)
+		return -EINVAL;
+	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
+		return 0;
+	alloc_memory_target(ma->proximity_domain);
+	return 0;
+}
+
+static __init void hmat_register_targets(void)
+{
+	struct memory_target *t, *next;
+	unsigned m, p;
+
+	list_for_each_entry_safe(t, next, &targets, node) {
+		list_del(&t->node);
+		m = pxm_to_node(t->memory_pxm);
+		for_each_set_bit(p, t->p_nodes, MAX_NUMNODES)
+			register_memory_node_under_compute_node(m, p, 0);
+		kfree(t);
+	}
+}
+
 static __init int hmat_init(void)
 {
 	struct acpi_table_header *tbl;
@@ -163,6 +280,17 @@ static __init int hmat_init(void)
 	if (srat_disabled())
 		return 0;
 
+	status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
+	if (ACPI_FAILURE(status))
+		return 0;
+
+	if (acpi_table_parse_entries(ACPI_SIG_SRAT,
+				sizeof(struct acpi_table_srat),
+				ACPI_SRAT_TYPE_MEMORY_AFFINITY,
+				srat_parse_mem_affinity, 0) < 0)
+		goto out_put;
+	acpi_put_table(tbl);
+
 	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
 	if (ACPI_FAILURE(status))
 		return 0;
@@ -173,6 +301,7 @@ static __init int hmat_init(void)
 					     hmat_parse_subtable, 0) < 0)
 			goto out_put;
 	}
+	hmat_register_targets();
 out_put:
 	acpi_put_table(tbl);
 	return 0;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 07/13] node: Add heterogenous memory access attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (5 preceding siblings ...)
  2019-01-16 17:57 ` [PATCHv4 06/13] acpi/hmat: Register processor domain to its memory Keith Busch
@ 2019-01-16 17:57 ` Keith Busch
  2019-01-17 15:03   ` Rafael J. Wysocki
  2019-01-16 17:57 ` [PATCHv4 08/13] Documentation/ABI: Add node performance attributes Keith Busch
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Provide a new kernel interface for
subsystems to register the attributes under the memory target node's
initiator access class. If the system provides this information, applications
may query these attributes when deciding which node to request memory.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree -P "read*|write*" /sys/devices/system/node/nodeY/classZ/
  /sys/devices/system/node/nodeY/classZ/
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in nanoseconds.
Memory accesses from an initiator node that is not one of the memory's
class "Z" initiator nodes may encounter different performance than
reported here. When a subsystem makes use of this interface, initiators
of a lower class number, "Z", have better performance relative to higher
class numbers. When provided, class 0 is the highest performing access
class.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/Kconfig |  8 ++++++++
 drivers/base/node.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/node.h | 25 +++++++++++++++++++++++++
 3 files changed, 81 insertions(+)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a900b330..6014980238e8 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE
 	  unusable. You should say N here unless you are explicitly looking to
 	  test this functionality.
 
+config HMEM_REPORTING
+	bool
+	default y
+	depends on NUMA
+	help
+	  Enable reporting for heterogenous memory access attributes under
+	  their non-uniform memory nodes.
+
 source "drivers/base/test/Kconfig"
 
 config SYS_HYPERVISOR
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 1da5072116ab..1e909f61e8b1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -66,6 +66,9 @@ struct node_class_nodes {
 	unsigned		class;
 	nodemask_t		initiator_nodes;
 	nodemask_t		target_nodes;
+#ifdef CONFIG_HMEM_REPORTING
+	struct node_hmem_attrs	hmem_attrs;
+#endif
 };
 #define to_class_nodes(dev) container_of(dev, struct node_class_nodes, dev)
 
@@ -145,6 +148,51 @@ static struct node_class_nodes *node_init_node_class(struct device *parent,
 	return NULL;
 }
 
+#ifdef CONFIG_HMEM_REPORTING
+#define ACCESS_ATTR(name) 						   \
+static ssize_t name##_show(struct device *dev,				   \
+			   struct device_attribute *attr,		   \
+			   char *buf)					   \
+{									   \
+	return sprintf(buf, "%u\n", to_class_nodes(dev)->hmem_attrs.name); \
+}									   \
+static DEVICE_ATTR_RO(name);
+
+ACCESS_ATTR(read_bandwidth)
+ACCESS_ATTR(read_latency)
+ACCESS_ATTR(write_bandwidth)
+ACCESS_ATTR(write_latency)
+
+static struct attribute *access_attrs[] = {
+	&dev_attr_read_bandwidth.attr,
+	&dev_attr_read_latency.attr,
+	&dev_attr_write_bandwidth.attr,
+	&dev_attr_write_latency.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(access);
+
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
+			 unsigned class)
+{
+	struct node_class_nodes *c;
+	struct node *node;
+
+	if (WARN_ON_ONCE(!node_online(nid)))
+		return;
+
+	node = node_devices[nid];
+	c = node_init_node_class(&node->dev, &node->class_list, class);
+	if (!c)
+		return;
+
+	c->hmem_attrs = *hmem_attrs;
+	if (sysfs_create_groups(&c->dev.kobj, access_groups))
+		pr_info("failed to add performance attribute group to node %d\n",
+			nid);
+}
+#endif
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
diff --git a/include/linux/node.h b/include/linux/node.h
index 8e3666c12ef2..e22940a593c2 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -20,6 +20,31 @@
 #include <linux/list.h>
 #include <linux/workqueue.h>
 
+#ifdef CONFIG_HMEM_REPORTING
+/**
+ * struct node_hmem_attrs - heterogeneous memory performance attributes
+ *
+ * @read_bandwidth:	Read bandwidth in MB/s
+ * @write_bandwidth:	Write bandwidth in MB/s
+ * @read_latency:	Read latency in nanoseconds
+ * @write_latency:	Write latency in nanoseconds
+ */
+struct node_hmem_attrs {
+	unsigned int read_bandwidth;
+	unsigned int write_bandwidth;
+	unsigned int read_latency;
+	unsigned int write_latency;
+};
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
+			 unsigned class);
+#else
+static inline void node_set_perf_attrs(unsigned int nid,
+				       struct node_hmem_attrs *hmem_attrs,
+				       unsigned class)
+{
+}
+#endif
+
 struct node {
 	struct device	dev;
 	struct list_head class_list;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 08/13] Documentation/ABI: Add node performance attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (6 preceding siblings ...)
  2019-01-16 17:57 ` [PATCHv4 07/13] node: Add heterogenous memory access attributes Keith Busch
@ 2019-01-16 17:57 ` Keith Busch
  2019-01-17 15:09   ` Rafael J. Wysocki
  2019-01-16 17:58 ` [PATCHv4 09/13] acpi/hmat: Register " Keith Busch
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Add descriptions for memory class initiator performance access attributes.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/ABI/stable/sysfs-devices-node | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index a9c47b4b0eee..2217557f29d3 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -114,3 +114,31 @@ Description:
 		The node list of memory targets that this initiator node has
 		class "Y" access. Memory accesses from this node to nodes not
 		in this list may have differet performance.
+
+What:		/sys/devices/system/node/nodeX/classY/read_bandwidth
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This node's read bandwidth in MB/s available to memory
+		initiators in nodes found in this class's initiators_nodelist.
+
+What:		/sys/devices/system/node/nodeX/classY/read_latency
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This node's read latency in nanoseconds available to memory
+		initiators in nodes found in this class's initiators_nodelist.
+
+What:		/sys/devices/system/node/nodeX/classY/write_bandwidth
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This node's write bandwidth in MB/s available to memory
+		initiators in nodes found in this class's initiators_nodelist.
+
+What:		/sys/devices/system/node/nodeX/classY/write_latency
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This node's write latency in nanoseconds available to memory
+		initiators in nodes found in this class's initiators_nodelist.
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 09/13] acpi/hmat: Register performance attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (7 preceding siblings ...)
  2019-01-16 17:57 ` [PATCHv4 08/13] Documentation/ABI: Add node performance attributes Keith Busch
@ 2019-01-16 17:58 ` Keith Busch
  2019-01-17 15:21   ` Rafael J. Wysocki
  2019-01-16 17:58 ` [PATCHv4 10/13] node: Add memory caching attributes Keith Busch
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:58 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Save the best performace access attributes and register these with the
memory's node if HMAT provides the locality table. While HMAT does make
it possible to know performance for all possible initiator-target
pairings, we export only the best pairings at this time.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/hmat/Kconfig |  1 +
 drivers/acpi/hmat/hmat.c  | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
index a4034d37a311..20a0e96ba58a 100644
--- a/drivers/acpi/hmat/Kconfig
+++ b/drivers/acpi/hmat/Kconfig
@@ -2,6 +2,7 @@
 config ACPI_HMAT
 	bool "ACPI Heterogeneous Memory Attribute Table Support"
 	depends on ACPI_NUMA
+	select HMEM_REPORTING
 	help
 	 Parses representation of the ACPI Heterogeneous Memory Attributes
 	 Table (HMAT) and set the memory node relationships and access
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index efb33c74d1a3..45e20dc677f9 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -23,6 +23,8 @@ struct memory_target {
 	struct list_head node;
 	unsigned int memory_pxm;
 	unsigned long p_nodes[BITS_TO_LONGS(MAX_NUMNODES)];
+	bool hmem_valid;
+	struct node_hmem_attrs hmem;
 };
 
 static __init struct memory_target *find_mem_target(unsigned int m)
@@ -108,6 +110,34 @@ static __init void hmat_update_access(u8 type, u32 value, u32 *best)
 	}
 }
 
+static __init void hmat_update_target(struct memory_target *t, u8 type,
+				      u32 value)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+		t->hmem.read_latency = value;
+		t->hmem.write_latency = value;
+		break;
+	case ACPI_HMAT_READ_LATENCY:
+		t->hmem.read_latency = value;
+		break;
+	case ACPI_HMAT_WRITE_LATENCY:
+		t->hmem.write_latency = value;
+		break;
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+		t->hmem.read_bandwidth = value;
+		t->hmem.write_bandwidth = value;
+		break;
+	case ACPI_HMAT_READ_BANDWIDTH:
+		t->hmem.read_bandwidth = value;
+		break;
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		t->hmem.write_bandwidth = value;
+		break;
+	}
+	t->hmem_valid = true;
+}
+
 static __init int hmat_parse_locality(union acpi_subtable_headers *header,
 				      const unsigned long end)
 {
@@ -166,6 +196,8 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
 					set_bit(p_node, t->p_nodes);
 			}
 		}
+		if (t && best)
+			hmat_update_target(t, type, best);
 	}
 	return 0;
 }
@@ -267,6 +299,8 @@ static __init void hmat_register_targets(void)
 		m = pxm_to_node(t->memory_pxm);
 		for_each_set_bit(p, t->p_nodes, MAX_NUMNODES)
 			register_memory_node_under_compute_node(m, p, 0);
+		if (t->hmem_valid)
+			node_set_perf_attrs(m, &t->hmem, 0);
 		kfree(t);
 	}
 }
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 10/13] node: Add memory caching attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (8 preceding siblings ...)
  2019-01-16 17:58 ` [PATCHv4 09/13] acpi/hmat: Register " Keith Busch
@ 2019-01-16 17:58 ` Keith Busch
  2019-01-17 16:00   ` Rafael J. Wysocki
  2019-02-09  8:20   ` Brice Goglin
  2019-01-16 17:58 ` [PATCHv4 11/13] Documentation/ABI: Add node cache attributes Keith Busch
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:58 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

System memory may have side caches to help improve access speed to
frequently requested address ranges. While the system provided cache is
transparent to the software accessing these memory ranges, applications
can optimize their own access based on cache attributes.

Provide a new API for the kernel to register these memory side caches
under the memory node that provides it.

The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
Unlike CPU cacheinfo, though, the node cache level is reported from
the view of the memory. A higher number is nearer to the CPU, while
lower levels are closer to the backing memory. Also unlike CPU cache,
it is assumed the system will handle flushing any dirty cached memory
to the last level on a power failure if the range is persistent memory.

The attributes we export are the cache size, the line size, associativity,
and write back policy.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/node.c  | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/node.h |  39 ++++++++++++++
 2 files changed, 181 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 1e909f61e8b1..7ff3ed566d7d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -191,6 +191,146 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
 		pr_info("failed to add performance attribute group to node %d\n",
 			nid);
 }
+
+struct node_cache_info {
+	struct device dev;
+	struct list_head node;
+	struct node_cache_attrs cache_attrs;
+};
+#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
+
+#define CACHE_ATTR(name, fmt) 						\
+static ssize_t name##_show(struct device *dev,				\
+			   struct device_attribute *attr,		\
+			   char *buf)					\
+{									\
+	return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
+}									\
+DEVICE_ATTR_RO(name);
+
+CACHE_ATTR(size, "%llu")
+CACHE_ATTR(level, "%u")
+CACHE_ATTR(line_size, "%u")
+CACHE_ATTR(associativity, "%u")
+CACHE_ATTR(write_policy, "%u")
+
+static struct attribute *cache_attrs[] = {
+	&dev_attr_level.attr,
+	&dev_attr_associativity.attr,
+	&dev_attr_size.attr,
+	&dev_attr_line_size.attr,
+	&dev_attr_write_policy.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(cache);
+
+static void node_cache_release(struct device *dev)
+{
+	kfree(dev);
+}
+
+static void node_cacheinfo_release(struct device *dev)
+{
+	struct node_cache_info *info = to_cache_info(dev);
+	kfree(info);
+}
+
+static void node_init_cache_dev(struct node *node)
+{
+	struct device *dev;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return;
+
+	dev->parent = &node->dev;
+	dev->release = node_cache_release;
+	if (dev_set_name(dev, "side_cache"))
+		goto free_dev;
+
+	if (device_register(dev))
+		goto free_name;
+
+	pm_runtime_no_callbacks(dev);
+	node->cache_dev = dev;
+	return;
+free_name:
+	kfree_const(dev->kobj.name);
+free_dev:
+	kfree(dev);
+}
+
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
+{
+	struct node_cache_info *info;
+	struct device *dev;
+	struct node *node;
+
+	if (!node_online(nid) || !node_devices[nid])
+		return;
+
+	node = node_devices[nid];
+	list_for_each_entry(info, &node->cache_attrs, node) {
+		if (info->cache_attrs.level == cache_attrs->level) {
+			dev_warn(&node->dev,
+				"attempt to add duplicate cache level:%d\n",
+				cache_attrs->level);
+			return;
+		}
+	}
+
+	if (!node->cache_dev)
+		node_init_cache_dev(node);
+	if (!node->cache_dev)
+		return;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return;
+
+	dev = &info->dev;
+	dev->parent = node->cache_dev;
+	dev->release = node_cacheinfo_release;
+	dev->groups = cache_groups;
+	if (dev_set_name(dev, "index%d", cache_attrs->level))
+		goto free_cache;
+
+	info->cache_attrs = *cache_attrs;
+	if (device_register(dev)) {
+		dev_warn(&node->dev, "failed to add cache level:%d\n",
+			 cache_attrs->level);
+		goto free_name;
+	}
+	pm_runtime_no_callbacks(dev);
+	list_add_tail(&info->node, &node->cache_attrs);
+	return;
+free_name:
+	kfree_const(dev->kobj.name);
+free_cache:
+	kfree(info);
+}
+
+static void node_remove_caches(struct node *node)
+{
+	struct node_cache_info *info, *next;
+
+	if (!node->cache_dev)
+		return;
+
+	list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
+		list_del(&info->node);
+		device_unregister(&info->dev);
+	}
+	device_unregister(node->cache_dev);
+}
+
+static void node_init_caches(unsigned int nid)
+{
+	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
+}
+#else
+static void node_init_caches(unsigned int nid) { }
+static void node_remove_caches(struct node *node) { }
 #endif
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
@@ -475,6 +615,7 @@ void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
 	node_remove_classes(node);
+	node_remove_caches(node);
 	device_unregister(&node->dev);
 }
 
@@ -755,6 +896,7 @@ int __register_one_node(int nid)
 	INIT_LIST_HEAD(&node_devices[nid]->class_list);
 	/* initialize work queue for memory hot plug */
 	init_node_hugetlb_work(nid);
+	node_init_caches(nid);
 
 	return error;
 }
diff --git a/include/linux/node.h b/include/linux/node.h
index e22940a593c2..8cdf2b2808e4 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -37,12 +37,47 @@ struct node_hmem_attrs {
 };
 void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
 			 unsigned class);
+
+enum cache_associativity {
+	NODE_CACHE_DIRECT_MAP,
+	NODE_CACHE_INDEXED,
+	NODE_CACHE_OTHER,
+};
+
+enum cache_write_policy {
+	NODE_CACHE_WRITE_BACK,
+	NODE_CACHE_WRITE_THROUGH,
+	NODE_CACHE_WRITE_OTHER,
+};
+
+/**
+ * struct node_cache_attrs - system memory caching attributes
+ *
+ * @associativity:	The ways memory blocks may be placed in cache
+ * @write_policy:	Write back or write through policy
+ * @size:		Total size of cache in bytes
+ * @line_size:		Number of bytes fetched on a cache miss
+ * @level:		Represents the cache hierarchy level
+ */
+struct node_cache_attrs {
+	enum cache_associativity associativity;
+	enum cache_write_policy write_policy;
+	u64 size;
+	u16 line_size;
+	u8  level;
+};
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
 #else
 static inline void node_set_perf_attrs(unsigned int nid,
 				       struct node_hmem_attrs *hmem_attrs,
 				       unsigned class)
 {
 }
+
+static inline void node_add_cache(unsigned int nid,
+				  struct node_cache_attrs *cache_attrs)
+{
+}
 #endif
 
 struct node {
@@ -51,6 +86,10 @@ struct node {
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
+#ifdef CONFIG_HMEM_REPORTING
+	struct list_head cache_attrs;
+	struct device *cache_dev;
+#endif
 };
 
 struct memory_block;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 11/13] Documentation/ABI: Add node cache attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (9 preceding siblings ...)
  2019-01-16 17:58 ` [PATCHv4 10/13] node: Add memory caching attributes Keith Busch
@ 2019-01-16 17:58 ` Keith Busch
  2019-01-17 16:25   ` Rafael J. Wysocki
  2019-01-16 17:58 ` [PATCHv4 12/13] acpi/hmat: Register memory side " Keith Busch
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:58 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Add the attributes for the system memory side caches.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/ABI/stable/sysfs-devices-node | 34 +++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 2217557f29d3..613d51fb52a3 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -142,3 +142,37 @@ Contact:	Keith Busch <keith.busch@intel.com>
 Description:
 		This node's write latency in nanoseconds available to memory
 		initiators in nodes found in this class's initiators_nodelist.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/associativity
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The caches associativity: 0 for direct mapped, non-zero if
+		indexed.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/level
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		This cache's level in the memory hierarchy. Matches 'Y' in the
+		directory name.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/line_size
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The number of bytes accessed from the next cache level on a
+		cache miss.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/size
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The size of this memory side cache in bytes.
+
+What:		/sys/devices/system/node/nodeX/side_cache/indexY/write_policy
+Date:		December 2018
+Contact:	Keith Busch <keith.busch@intel.com>
+Description:
+		The cache write policy: 0 for write-back, 1 for write-through,
+		2 for other or unknown.
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 12/13] acpi/hmat: Register memory side cache attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (10 preceding siblings ...)
  2019-01-16 17:58 ` [PATCHv4 11/13] Documentation/ABI: Add node cache attributes Keith Busch
@ 2019-01-16 17:58 ` Keith Busch
  2019-01-17 17:42   ` Rafael J. Wysocki
  2019-01-16 17:58 ` [PATCHv4 13/13] doc/mm: New documentation for memory performance Keith Busch
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:58 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Register memory side cache attributes with the memory's node if HMAT
provides the side cache iniformation table.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/hmat/hmat.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index 45e20dc677f9..9efdd0a63a79 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -206,6 +206,7 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
 				   const unsigned long end)
 {
 	struct acpi_hmat_cache *cache = (void *)header;
+	struct node_cache_attrs cache_attrs;
 	u32 attrs;
 
 	if (cache->header.length < sizeof(*cache)) {
@@ -219,6 +220,37 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
 		cache->memory_PD, cache->cache_size, attrs,
 		cache->number_of_SMBIOShandles);
 
+	cache_attrs.size = cache->cache_size;
+	cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4;
+	cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16;
+
+	switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) {
+	case ACPI_HMAT_CA_DIRECT_MAPPED:
+		cache_attrs.associativity = NODE_CACHE_DIRECT_MAP;
+		break;
+	case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING:
+		cache_attrs.associativity = NODE_CACHE_INDEXED;
+		break;
+	case ACPI_HMAT_CA_NONE:
+	default:
+		cache_attrs.associativity = NODE_CACHE_OTHER;
+		break;
+	}
+
+	switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) {
+	case ACPI_HMAT_CP_WB:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_BACK;
+		break;
+	case ACPI_HMAT_CP_WT:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH;
+		break;
+	case ACPI_HMAT_CP_NONE:
+	default:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER;
+		break;
+	}
+
+	node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs);
 	return 0;
 }
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv4 13/13] doc/mm: New documentation for memory performance
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (11 preceding siblings ...)
  2019-01-16 17:58 ` [PATCHv4 12/13] acpi/hmat: Register memory side " Keith Busch
@ 2019-01-16 17:58 ` Keith Busch
  2019-01-17 12:58 ` [PATCHv4 00/13] Heterogeneuos memory node attributes Balbir Singh
  2019-01-17 18:18 ` Jonathan Cameron
  14 siblings, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-01-16 17:58 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Platforms may provide system memory where some physical address ranges
perform differently than others, or is side cached by the system.

Add documentation describing a high level overview of such systems and the
perforamnce and caching attributes the kernel provides for applications
wishing to query this information.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/admin-guide/mm/numaperf.rst | 184 ++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/numaperf.rst

diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
new file mode 100644
index 000000000000..963fbd3004d3
--- /dev/null
+++ b/Documentation/admin-guide/mm/numaperf.rst
@@ -0,0 +1,184 @@
+.. _numaperf:
+
+=============
+NUMA Locality
+=============
+
+Some platforms may have multiple types of memory attached to a single
+CPU. These disparate memory ranges share some characteristics, such as
+CPU cache coherence, but may have different performance. For example,
+different media types and buses affect bandwidth and latency.
+
+A system supporting such heterogeneous memory by grouping each memory
+type under different "nodes" based on similar CPU locality and performance
+characteristics.  Some memory may share the same node as a CPU, and others
+are provided as memory only nodes. While memory only nodes do not provide
+CPUs, they may still be directly accessible, or local, to one or more
+compute nodes. The following diagram shows one such example of two compute
+nodes with local memory and a memory only node for each of compute node:
+
+ +------------------+     +------------------+
+ | Compute Node 0   +-----+ Compute Node 1   |
+ | Local Node0 Mem  |     | Local Node1 Mem  |
+ +--------+---------+     +--------+---------+
+          |                        |
+ +--------+---------+     +--------+---------+
+ | Slower Node2 Mem |     | Slower Node3 Mem |
+ +------------------+     +--------+---------+
+
+A "memory initiator" is a node containing one or more devices such as
+CPUs or separate memory I/O devices that can initiate memory requests.
+A "memory target" is a node containing one or more physical address
+ranges accessible from one or more memory initiators.
+
+When multiple memory initiators exist, they may not all have the same
+performance when accessing a given memory target. Each initiator-target
+pair may be organized into different ranked access classes to represent
+this relationship. The highest performing initiator to a given target
+is considered to be one of that target's local initiators, and given
+the highest access class, 0. Any given target may have one or more
+local initiators, and any given initiator may have multiple local
+memory targets.
+
+To aid applications matching memory targets with their initiators, the
+kernel provides symlinks to each other. The following example lists the
+relationship for the class "0" memory initiators and targets, which is
+the of nodes with the highest performing access relationship::
+
+	# symlinks -v /sys/devices/system/node/nodeX/class0/
+	relative: /sys/devices/system/node/nodeX/class0/targetY -> ../../nodeY
+
+	# symlinks -v /sys/devices/system/node/nodeY/class0/
+	relative: /sys/devices/system/node/nodeY/class0/initiatorX -> ../../nodeX
+
+The linked nodes will also have their node numbers set in the class's
+target and initiator nodelist entries. Following the same example as
+above may look like the following::
+
+	# cat /sys/devices/system/node/nodeX/class0/target_nodelist
+	Y
+
+	# cat /sys/devices/system/node/nodeY/class0/initiator_nodelist
+	X
+
+An example showing how this may be used to run a particular task on CPUs
+and memory using best class nodes for a particular PCI device can be done
+using existing 'numactl' as follows::
+
+  # NODE=$(cat /sys/devices/pci:0000:00/.../numa_node)
+  # numactl --membind=$(cat /sys/devices/node/node${NODE}/class0/target_nodelist) \
+      --cpunodebind=$(cat /sys/devices/node/node${NODE}/class0/initiator_nodelist) \
+      -- <some-program-to-execute>
+
+================
+NUMA Performance
+================
+
+Applications may wish to consider which node they want their memory to
+be allocated from based on the node's performance characteristics. If
+the system provides these attributes, the kernel exports them under the
+node sysfs hierarchy by appending the attributes directory under the
+memory node's class 0 initiators as follows::
+
+	/sys/devices/system/node/nodeY/class0/
+
+These attributes apply only to the memory initiator nodes that have the
+same class access and are symlink under the class, and are set in the
+initiators' nodelist.
+
+The performance characteristics the kernel provides for the local initiators
+are exported are as follows::
+
+	# tree -P "read*|write*" /sys/devices/system/node/nodeY/class0/
+	/sys/devices/system/node/nodeY/class0/
+	|-- read_bandwidth
+	|-- read_latency
+	|-- write_bandwidth
+	`-- write_latency
+
+The bandwidth attributes are provided in MiB/second.
+
+The latency attributes are provided in nanoseconds.
+
+==========
+NUMA Cache
+==========
+
+System memory may be constructed in a hierarchy of elements with various
+performance characteristics in order to provide large address space of
+slower performing memory side-cached by a smaller higher performing
+memory. The system physical addresses that initiators are aware of
+are provided by the last memory level in the hierarchy. The system
+meanwhile uses higher performing memory to transparently cache access
+to progressively slower levels.
+
+The term "far memory" is used to denote the last level memory in the
+hierarchy. Each increasing cache level provides higher performing
+initiator access, and the term "near memory" represents the fastest
+cache provided by the system.
+
+This numbering is different than CPU caches where the cache level (ex:
+L1, L2, L3) uses a CPU centric view with each increased level is lower
+performing. In contrast, the memory cache level is centric to the last
+level memory, so the higher numbered cache level denotes memory nearer
+to the CPU, and further from far memory.
+
+The memory side caches are not directly addressable by software. When
+software accesses a system address, the system will return it from the
+near memory cache if it is present. If it is not present, the system
+accesses the next level of memory until there is either a hit in that
+cache level, or it reaches far memory.
+
+An application does not need to know about caching attributes in order
+to use the system. Software may optionally query the memory cache
+attributes in order to maximize the performance out of such a setup.
+If the system provides a way for the kernel to discover this information,
+for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
+the kernel will append these attributes to the NUMA node memory target.
+
+When the kernel first registers a memory cache with a node, the kernel
+will create the following directory::
+
+	/sys/devices/system/node/nodeX/side_cache/
+
+If that directory is not present, the system either does not not provide
+a memory side cache, or that information is not accessible to the kernel.
+
+The attributes for each level of cache is provided under its cache
+level index::
+
+	/sys/devices/system/node/nodeX/side_cache/indexA/
+	/sys/devices/system/node/nodeX/side_cache/indexB/
+	/sys/devices/system/node/nodeX/side_cache/indexC/
+
+Each cache level's directory provides its attributes. For example, the
+following shows a single cache level and the attributes available for
+software to query::
+
+	# tree sys/devices/system/node/node0/side_cache/
+	/sys/devices/system/node/node0/side_cache/
+	|-- index1
+	|   |-- associativity
+	|   |-- level
+	|   |-- line_size
+	|   |-- size
+	|   `-- write_policy
+
+The "associativity" will be 0 if it is a direct-mapped cache, and non-zero
+for any other indexed based, multi-way associativity.
+
+The "level" is the distance from the far memory, and matches the number
+appended to its "index" directory.
+
+The "line_size" is the number of bytes accessed on a cache miss.
+
+The "size" is the number of bytes provided by this cache level.
+
+The "write_policy" will be 0 for write-back, and non-zero for
+write-through caching.
+
+========
+See Also
+========
+.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
+       Section 5.2.27
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 03/13] acpi/hmat: Parse and report heterogeneous memory
  2019-01-16 17:57 ` [PATCHv4 03/13] acpi/hmat: Parse and report heterogeneous memory Keith Busch
@ 2019-01-17 11:00   ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 11:00 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Systems may provide different memory types and export this information
> in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
> tables provided by the platform and report the memory access and caching
> attributes.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/acpi/Kconfig       |   1 +
>  drivers/acpi/Makefile      |   1 +
>  drivers/acpi/hmat/Kconfig  |   8 ++
>  drivers/acpi/hmat/Makefile |   1 +
>  drivers/acpi/hmat/hmat.c   | 180 +++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 191 insertions(+)
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
>
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index 90ff0a47c12e..b377f970adfd 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -465,6 +465,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
>           If you are unsure what to do, do not enable this option.
>
>  source "drivers/acpi/nfit/Kconfig"
> +source "drivers/acpi/hmat/Kconfig"
>
>  source "drivers/acpi/apei/Kconfig"
>  source "drivers/acpi/dptf/Kconfig"
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index 7c6afc111d76..bff8fbe5a6ab 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -79,6 +79,7 @@ obj-$(CONFIG_ACPI_PROCESSOR)  += processor.o
>  obj-$(CONFIG_ACPI)             += container.o
>  obj-$(CONFIG_ACPI_THERMAL)     += thermal.o
>  obj-$(CONFIG_ACPI_NFIT)                += nfit/
> +obj-$(CONFIG_ACPI_HMAT)                += hmat/

Yes, I prefer it to go into a separate directory.

Who do you want to maintain it, me or Dan?

>  obj-$(CONFIG_ACPI)             += acpi_memhotplug.o
>  obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
>  obj-$(CONFIG_ACPI_BATTERY)     += battery.o
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> new file mode 100644
> index 000000000000..a4034d37a311
> --- /dev/null
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config ACPI_HMAT
> +       bool "ACPI Heterogeneous Memory Attribute Table Support"
> +       depends on ACPI_NUMA
> +       help
> +        Parses representation of the ACPI Heterogeneous Memory Attributes
> +        Table (HMAT) and set the memory node relationships and access
> +        attributes.

What about:

"If set, this option causes the kernel to set the memory NUMA node
relationships and access attributes in accordance with ACPI HMAT
(Heterogeneous Memory Attributes Table)."

> diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile
> new file mode 100644
> index 000000000000..e909051d3d00
> --- /dev/null
> +++ b/drivers/acpi/hmat/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_ACPI_HMAT) := hmat.o
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> new file mode 100644
> index 000000000000..833a783868d5
> --- /dev/null
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -0,0 +1,180 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Heterogeneous Memory Attributes Table (HMAT) representation
> + *
> + * Copyright (c) 2018, Intel Corporation.

Can you put a comment describing the code somewhat in here?

> + */
> +
> +#include <acpi/acpi_numa.h>
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/cpu.h>
> +#include <linux/device.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/node.h>
> +#include <linux/slab.h>
> +#include <linux/sysfs.h>

Are all of the headers above really necessary to build the code?

> +
> +static __init const char *hmat_data_type(u8 type)
> +{
> +       switch (type) {
> +       case ACPI_HMAT_ACCESS_LATENCY:
> +               return "Access Latency";
> +       case ACPI_HMAT_READ_LATENCY:
> +               return "Read Latency";
> +       case ACPI_HMAT_WRITE_LATENCY:
> +               return "Write Latency";
> +       case ACPI_HMAT_ACCESS_BANDWIDTH:
> +               return "Access Bandwidth";
> +       case ACPI_HMAT_READ_BANDWIDTH:
> +               return "Read Bandwidth";
> +       case ACPI_HMAT_WRITE_BANDWIDTH:
> +               return "Write Bandwidth";
> +       default:
> +               return "Reserved";
> +       };
> +}
> +
> +static __init const char *hmat_data_type_suffix(u8 type)
> +{
> +       switch (type) {
> +       case ACPI_HMAT_ACCESS_LATENCY:
> +       case ACPI_HMAT_READ_LATENCY:
> +       case ACPI_HMAT_WRITE_LATENCY:
> +               return " nsec";
> +       case ACPI_HMAT_ACCESS_BANDWIDTH:
> +       case ACPI_HMAT_READ_BANDWIDTH:
> +       case ACPI_HMAT_WRITE_BANDWIDTH:
> +               return " MB/s";
> +       default:
> +               return "";
> +       };
> +}
> +
> +static __init int hmat_parse_locality(union acpi_subtable_headers *header,
> +                                     const unsigned long end)
> +{
> +       struct acpi_hmat_locality *loc = (void *)header;
> +       unsigned int init, targ, total_size, ipds, tpds;
> +       u32 *inits, *targs, value;
> +       u16 *entries;
> +       u8 type;
> +
> +       if (loc->header.length < sizeof(*loc)) {
> +               pr_err("HMAT: Unexpected locality header length: %d\n",
> +                       loc->header.length);

Why pr_err()?  Is the error really high-prio?

Same below.

> +               return -EINVAL;
> +       }
> +
> +       type = loc->data_type;
> +       ipds = loc->number_of_initiator_Pds;
> +       tpds = loc->number_of_target_Pds;
> +       total_size = sizeof(*loc) + sizeof(*entries) * ipds * tpds +
> +                    sizeof(*inits) * ipds + sizeof(*targs) * tpds;
> +       if (loc->header.length < total_size) {
> +               pr_err("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
> +                       loc->header.length, total_size);
> +               return -EINVAL;
> +       }
> +
> +       pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
> +               loc->flags, hmat_data_type(type), ipds, tpds,
> +               loc->entry_base_unit);
> +
> +       inits = (u32 *)(loc + 1);
> +       targs = &inits[ipds];
> +       entries = (u16 *)(&targs[tpds]);
> +       for (targ = 0; targ < tpds; targ++) {
> +               for (init = 0; init < ipds; init++) {
> +                       value = entries[init * tpds + targ];
> +                       value = (value * loc->entry_base_unit) / 10;
> +                       pr_info("  Initiator-Target[%d-%d]:%d%s\n",
> +                               inits[init], targs[targ], value,
> +                               hmat_data_type_suffix(type));
> +               }
> +       }
> +       return 0;
> +}

The format and meaning of what is printed into the log should be
documented somewhere IMO.

Of course, that applies to the functions below as well.

> +
> +static __init int hmat_parse_cache(union acpi_subtable_headers *header,
> +                                  const unsigned long end)
> +{
> +       struct acpi_hmat_cache *cache = (void *)header;
> +       u32 attrs;
> +
> +       if (cache->header.length < sizeof(*cache)) {
> +               pr_err("HMAT: Unexpected cache header length: %d\n",
> +                       cache->header.length);
> +               return -EINVAL;
> +       }
> +
> +       attrs = cache->cache_attributes;
> +       pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
> +               cache->memory_PD, cache->cache_size, attrs,
> +               cache->number_of_SMBIOShandles);
> +
> +       return 0;
> +}
> +
> +static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
> +                                          const unsigned long end)
> +{
> +       struct acpi_hmat_address_range *spa = (void *)header;
> +
> +       if (spa->header.length != sizeof(*spa)) {
> +               pr_err("HMAT: Unexpected address range header length: %d\n",
> +                       spa->header.length);
> +               return -EINVAL;
> +       }
> +       pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
> +               spa->physical_address_base, spa->physical_address_length,
> +               spa->flags, spa->processor_PD, spa->memory_PD);
> +       return 0;
> +}
> +
> +static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
> +                                     const unsigned long end)
> +{
> +       struct acpi_hmat_structure *hdr = (void *)header;
> +
> +       if (!hdr)
> +               return -EINVAL;
> +
> +       switch (hdr->type) {
> +       case ACPI_HMAT_TYPE_ADDRESS_RANGE:
> +               return hmat_parse_address_range(header, end);
> +       case ACPI_HMAT_TYPE_LOCALITY:
> +               return hmat_parse_locality(header, end);
> +       case ACPI_HMAT_TYPE_CACHE:
> +               return hmat_parse_cache(header, end);
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
> +static __init int hmat_init(void)
> +{
> +       struct acpi_table_header *tbl;
> +       enum acpi_hmat_type i;
> +       acpi_status status;
> +
> +       if (srat_disabled())
> +               return 0;
> +
> +       status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
> +       if (ACPI_FAILURE(status))
> +               return 0;
> +
> +       for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
> +               if (acpi_table_parse_entries(ACPI_SIG_HMAT,
> +                                            sizeof(struct acpi_table_hmat), i,
> +                                            hmat_parse_subtable, 0) < 0)
> +                       goto out_put;
> +       }
> +out_put:
> +       acpi_put_table(tbl);
> +       return 0;
> +}
> +subsys_initcall(hmat_init);
> --

It looks like this particular patch only causes some extra messages to
be printed into the log, no attributes setting etc yet.

I would like the changelog to mention that.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 04/13] node: Link memory nodes to their compute nodes
  2019-01-16 17:57 ` [PATCHv4 04/13] node: Link memory nodes to their compute nodes Keith Busch
@ 2019-01-17 11:26   ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 11:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Systems may be constructed with various specialized nodes. Some nodes
> may provide memory, some provide compute devices that access and use
> that memory, and others may provide both. Nodes that provide memory are
> referred to as memory targets, and nodes that can initiate memory access
> are referred to as memory initiators.
>
> Memory targets will often have varying access characteristics from
> different initiators, and platforms may have ways to express those
> relationships. In preparation for these systems, provide interfaces
> for the kernel to export the memory relationship among different nodes
> memory targets and their initiators with symlinks to each other's nodes,
> and export node lists showing the same relationship.
>
> If a system provides access locality for each initiator-target pair, nodes
> may be grouped into ranked access classes relative to other nodes. The new
> interface allows a subsystem to register relationships of varying classes
> if available and desired to be exported. A lower class number indicates
> a higher performing tier, with 0 being the best performing class.
>
> A memory initiator may have multiple memory targets in the same access
> class. The initiator's memory targets in given class indicate the node's
> access characteristics perform better relative to other initiator nodes
> either unreported or in lower class numbers. The targets within an
> initiator's class, though, do not necessarily perform the same as each
> other.
>
> A memory target node may have multiple memory initiators. All linked
> initiators in a target's class have the same access characteristics to
> that target.
>
> The following example show the nodes' new sysfs hierarchy for a memory
> target node 'Y' with class 0 access from initiator node 'X':
>
>   # symlinks -v /sys/devices/system/node/nodeX/class0/
>   relative: /sys/devices/system/node/nodeX/class0/targetY -> ../../nodeY

If you added one more directory level and had "targets" and
"initiators" under "class0", the names of the symlinks could be the
same as the names of the nodes themselves, that is

/sys/devices/system/node/nodeX/class0/targets/nodeY -> ../../../nodeY

and the whole "nodelist" part wouldn't be necessary any more.

Also, it looks like "class0" is just a name at this point, but it will
represent an access class going forward.  Maybe it would be better to
use the word "access" in the directory name to indicate that (so there
would be "access0" instead of "class0").

>
>   # symlinks -v /sys/devices/system/node/nodeY/class0/
>   relative: /sys/devices/system/node/nodeY/class0/initiatorX -> ../../nodeX
>
> And the same information is reflected in the nodelist:
>
>   # cat /sys/devices/system/node/nodeX/class0/target_nodelist
>   Y
>
>   # cat /sys/devices/system/node/nodeY/class0/initiator_nodelist
>   X
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/node.h |   6 ++-
>  2 files changed, 131 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..1da5072116ab 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/cpu.h>
>  #include <linux/device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
>
> @@ -59,6 +60,91 @@ static inline ssize_t node_read_cpulist(struct device *dev,
>  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>

A kerneldoc describing the struct type, please.

> +struct node_class_nodes {
> +       struct device           dev;
> +       struct list_head        list_node;
> +       unsigned                class;
> +       nodemask_t              initiator_nodes;
> +       nodemask_t              target_nodes;
> +};
> +#define to_class_nodes(dev) container_of(dev, struct node_class_nodes, dev)
> +

Generally speaking, your code is devoid of comments.

To a minimum, there should be kerneldoc comments for non-static
functions to explain what they are for.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-16 17:57 ` [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes Keith Busch
@ 2019-01-17 11:41   ` Rafael J. Wysocki
  2019-01-18 20:42     ` Keith Busch
  2019-01-18 21:08     ` Dan Williams
  2019-01-18 11:21   ` Jonathan Cameron
  1 sibling, 2 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 11:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Add entries for memory initiator and target node class attributes.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>

I would recommend combining this with the previous patch, as the way
it is now I need to look at two patches at the time. :-)

> ---
>  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..a9c47b4b0eee 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:                December 2009
>  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
>  Description:
>                 The node's huge page size control/query attributes.
> -               See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +               See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:          /sys/devices/system/node/nodeX/classY/
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The node's relationship to other nodes for access class "Y".
> +
> +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The node list of memory initiators that have class "Y" access
> +               to this node's memory. CPUs and other memory initiators in
> +               nodes not in the list accessing this node's memory may have
> +               different performance.

This does not follow the general "one value per file" rule of sysfs (I
know that there are other sysfs files with more than one value in
them, but it is better to follow this rule as long as that makes
sense).

> +
> +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The node list of memory targets that this initiator node has
> +               class "Y" access. Memory accesses from this node to nodes not
> +               in this list may have differet performance.
> --

Same here.

And if you follow the recommendation given in the previous message
(add "initiators" and "targets" subdirs under "classX"), you won't
even need the two files above.

And, of course, the symlinks part needs to be documented as well.  I
guess you can follow the
Documentation/ABI/testing/sysfs-devices-power_resources_D0 with that.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 06/13] acpi/hmat: Register processor domain to its memory
  2019-01-16 17:57 ` [PATCHv4 06/13] acpi/hmat: Register processor domain to its memory Keith Busch
@ 2019-01-17 12:11   ` Rafael J. Wysocki
  2019-01-17 17:01     ` Dan Williams
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 12:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

    On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> If the HMAT Subsystem Address Range provides a valid processor proximity
> domain for a memory domain, or a processor domain with the highest
> performing access exists, register the memory target with that initiator
> so this relationship will be visible under the node's sysfs directory.
>
> Since HMAT requires valid address ranges have an equivalent SRAT entry,
> verify each memory target satisfies this requirement.

What exactly will happen after this patch?

There will be some new directories under
/sys/devices/system/node/nodeX/ if all goes well.  Anything else?

> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/acpi/hmat/hmat.c | 143 ++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 136 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 833a783868d5..efb33c74d1a3 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -17,6 +17,43 @@
>  #include <linux/slab.h>
>  #include <linux/sysfs.h>
>
> +static LIST_HEAD(targets);
> +

A kerneldoc documenting the struct type here, please.

> +struct memory_target {
> +       struct list_head node;
> +       unsigned int memory_pxm;
> +       unsigned long p_nodes[BITS_TO_LONGS(MAX_NUMNODES)];
> +};
> +
> +static __init struct memory_target *find_mem_target(unsigned int m)

Why don't you call the arg mem_pxm like below?

> +{
> +       struct memory_target *t;
> +
> +       list_for_each_entry(t, &targets, node)
> +               if (t->memory_pxm == m)
> +                       return t;
> +       return NULL;
> +}
> +
> +static __init void alloc_memory_target(unsigned int mem_pxm)
> +{
> +       struct memory_target *t;
> +
> +       if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
> +               return;
> +
> +       t = find_mem_target(mem_pxm);
> +       if (t)
> +               return;
> +
> +       t = kzalloc(sizeof(*t), GFP_KERNEL);
> +       if (!t)
> +               return;
> +
> +       t->memory_pxm = mem_pxm;
> +       list_add_tail(&t->node, &targets);
> +}
> +
>  static __init const char *hmat_data_type(u8 type)
>  {
>         switch (type) {
> @@ -53,11 +90,30 @@ static __init const char *hmat_data_type_suffix(u8 type)
>         };
>  }
>
> +static __init void hmat_update_access(u8 type, u32 value, u32 *best)

I guess that you pass a pointer to avoid unnecessary updates, right?

But that causes you to dereference that pointer quite often.  It might
be better to pass the current value of 'best' and return an updated
one (which may be the same as the passed one, of course).

> +{
> +       switch (type) {
> +       case ACPI_HMAT_ACCESS_LATENCY:
> +       case ACPI_HMAT_READ_LATENCY:
> +       case ACPI_HMAT_WRITE_LATENCY:
> +               if (!*best || *best > value)
> +                       *best = value;
> +               break;
> +       case ACPI_HMAT_ACCESS_BANDWIDTH:
> +       case ACPI_HMAT_READ_BANDWIDTH:
> +       case ACPI_HMAT_WRITE_BANDWIDTH:
> +               if (!*best || *best < value)
> +                       *best = value;
> +               break;
> +       }
> +}
> +
>  static __init int hmat_parse_locality(union acpi_subtable_headers *header,
>                                       const unsigned long end)
>  {
> +       struct memory_target *t;

I would call this variable mem_target.  't' is too easy to overlook
IMO.  [Same below]

>         struct acpi_hmat_locality *loc = (void *)header;
> -       unsigned int init, targ, total_size, ipds, tpds;
> +       unsigned int init, targ, pass, p_node, total_size, ipds, tpds;
>         u32 *inits, *targs, value;
>         u16 *entries;
>         u8 type;
> @@ -87,12 +143,28 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
>         targs = &inits[ipds];
>         entries = (u16 *)(&targs[tpds]);
>         for (targ = 0; targ < tpds; targ++) {
> -               for (init = 0; init < ipds; init++) {
> -                       value = entries[init * tpds + targ];
> -                       value = (value * loc->entry_base_unit) / 10;
> -                       pr_info("  Initiator-Target[%d-%d]:%d%s\n",
> -                               inits[init], targs[targ], value,
> -                               hmat_data_type_suffix(type));
> +               u32 best = 0;
> +
> +               t = find_mem_target(targs[targ]);
> +               for (pass = 0; pass < 2; pass++) {
> +                       for (init = 0; init < ipds; init++) {
> +                               value = entries[init * tpds + targ];
> +                               value = (value * loc->entry_base_unit) / 10;
> +
> +                               if (!pass) {
> +                                       hmat_update_access(type, value, &best);
> +                                       pr_info("  Initiator-Target[%d-%d]:%d%s\n",
> +                                               inits[init], targs[targ], value,
> +                                               hmat_data_type_suffix(type));
> +                                       continue;
> +                               }
> +
> +                               if (!t)
> +                                       continue;
> +                               p_node = pxm_to_node(inits[init]);
> +                               if (p_node != NUMA_NO_NODE && value == best)
> +                                       set_bit(p_node, t->p_nodes);
> +                       }
>                 }
>         }
>         return 0;
> @@ -122,6 +194,7 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
>                                            const unsigned long end)
>  {
>         struct acpi_hmat_address_range *spa = (void *)header;
> +       struct memory_target *t = NULL;
>
>         if (spa->header.length != sizeof(*spa)) {
>                 pr_err("HMAT: Unexpected address range header length: %d\n",
> @@ -131,6 +204,23 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
>         pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
>                 spa->physical_address_base, spa->physical_address_length,
>                 spa->flags, spa->processor_PD, spa->memory_PD);
> +
> +       if (spa->flags & ACPI_HMAT_MEMORY_PD_VALID) {
> +               t = find_mem_target(spa->memory_PD);
> +               if (!t) {
> +                       pr_warn("HMAT: Memory Domain missing from SRAT\n");

Again, I'm wondering about the log level here.  I "warning" really adequate?

> +                       return -EINVAL;
> +               }
> +       }
> +       if (t && spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
> +               int p_node = pxm_to_node(spa->processor_PD);
> +
> +               if (p_node == NUMA_NO_NODE) {
> +                       pr_warn("HMAT: Invalid Processor Domain\n");

Same here.

> +                       return -EINVAL;
> +               }
> +               set_bit(p_node, t->p_nodes);
> +       }
>         return 0;
>  }
>
> @@ -154,6 +244,33 @@ static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
>         }
>  }
>
> +static __init int srat_parse_mem_affinity(union acpi_subtable_headers *header,
> +                                         const unsigned long end)
> +{
> +       struct acpi_srat_mem_affinity *ma = (void *)header;
> +
> +       if (!ma)
> +               return -EINVAL;
> +       if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
> +               return 0;
> +       alloc_memory_target(ma->proximity_domain);
> +       return 0;
> +}
> +
> +static __init void hmat_register_targets(void)
> +{
> +       struct memory_target *t, *next;
> +       unsigned m, p;
> +
> +       list_for_each_entry_safe(t, next, &targets, node) {
> +               list_del(&t->node);
> +               m = pxm_to_node(t->memory_pxm);
> +               for_each_set_bit(p, t->p_nodes, MAX_NUMNODES)
> +                       register_memory_node_under_compute_node(m, p, 0);
> +               kfree(t);
> +       }
> +}
> +
>  static __init int hmat_init(void)
>  {
>         struct acpi_table_header *tbl;
> @@ -163,6 +280,17 @@ static __init int hmat_init(void)
>         if (srat_disabled())
>                 return 0;
>
> +       status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
> +       if (ACPI_FAILURE(status))
> +               return 0;
> +
> +       if (acpi_table_parse_entries(ACPI_SIG_SRAT,
> +                               sizeof(struct acpi_table_srat),
> +                               ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> +                               srat_parse_mem_affinity, 0) < 0)

Can you do

ret = acpi_table_parse_entries(ACPI_SIG_SRAT, sizeof(struct acpi_table_srat),
                               ACPI_SRAT_TYPE_MEMORY_AFFINITY,
srat_parse_mem_affinity, 0);
if (ret < 0)
        goto out_put;

here instead?  The current one is barely readable.

Also please add a comment to explain what it means if this returns an error.

> +               goto out_put;
> +       acpi_put_table(tbl);
> +
>         status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
>         if (ACPI_FAILURE(status))
>                 return 0;
> @@ -173,6 +301,7 @@ static __init int hmat_init(void)
>                                              hmat_parse_subtable, 0) < 0)
>                         goto out_put;
>         }
> +       hmat_register_targets();
>  out_put:
>         acpi_put_table(tbl);
>         return 0;
> --

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 00/13] Heterogeneuos memory node attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (12 preceding siblings ...)
  2019-01-16 17:58 ` [PATCHv4 13/13] doc/mm: New documentation for memory performance Keith Busch
@ 2019-01-17 12:58 ` Balbir Singh
  2019-01-17 15:44   ` Keith Busch
  2019-01-17 18:18 ` Jonathan Cameron
  14 siblings, 1 reply; 51+ messages in thread
From: Balbir Singh @ 2019-01-17 12:58 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 10:57:51AM -0700, Keith Busch wrote:
> The series seems quite calm now. I've received some approvals of the
> on the proposal, and heard no objections on the new core interfaces.
> 
> Please let me know if there is anyone or group of people I should request
> and wait for a review. And if anyone reading this would like additional
> time as well before I post a potentially subsequent version, please let
> me know.
> 
> I also wanted to inquire on upstream strategy if/when all desired
> reviews are received. The series is spanning a few subsystems, so I'm
> not sure who's tree is the best candidate. I could see an argument for
> driver-core, acpi, or mm as possible paths. Please let me know if there's
> a more appropriate option or any other gating concerns.
> 
> == Changes from v3 ==
> 
>   I've fixed the documentation issues that have been raised for v3 
> 
>   Moved the hmat files according to Rafael's recommendation
> 
>   Added received Reviewed-by's
> 
> Otherwise this v4 is much the same as v3.
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 

Could you please expand on this text -- how are these attributes
exposed/consumed by both the kernel and user space?

> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.

I presume these tools and libraries are numactl and mbind()?

Balbir Singh.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 07/13] node: Add heterogenous memory access attributes
  2019-01-16 17:57 ` [PATCHv4 07/13] node: Add heterogenous memory access attributes Keith Busch
@ 2019-01-17 15:03   ` Rafael J. Wysocki
  2019-01-17 15:41     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 15:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

 On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Heterogeneous memory systems provide memory nodes with different latency
> and bandwidth performance attributes. Provide a new kernel interface for
> subsystems to register the attributes under the memory target node's
> initiator access class. If the system provides this information, applications
> may query these attributes when deciding which node to request memory.
>
> The following example shows the new sysfs hierarchy for a node exporting
> performance attributes:
>
>   # tree -P "read*|write*" /sys/devices/system/node/nodeY/classZ/
>   /sys/devices/system/node/nodeY/classZ/
>   |-- read_bandwidth
>   |-- read_latency
>   |-- write_bandwidth
>   `-- write_latency
>
> The bandwidth is exported as MB/s and latency is reported in nanoseconds.
> Memory accesses from an initiator node that is not one of the memory's
> class "Z" initiator nodes may encounter different performance than
> reported here. When a subsystem makes use of this interface, initiators
> of a lower class number, "Z", have better performance relative to higher
> class numbers. When provided, class 0 is the highest performing access
> class.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/Kconfig |  8 ++++++++
>  drivers/base/node.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/node.h | 25 +++++++++++++++++++++++++
>  3 files changed, 81 insertions(+)
>
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 3e63a900b330..6014980238e8 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE
>           unusable. You should say N here unless you are explicitly looking to
>           test this functionality.
>
> +config HMEM_REPORTING
> +       bool
> +       default y
> +       depends on NUMA
> +       help
> +         Enable reporting for heterogenous memory access attributes under
> +         their non-uniform memory nodes.

Why would anyone ever want to say "no" to this?

Distros will set it anyway.

> +
>  source "drivers/base/test/Kconfig"
>
>  config SYS_HYPERVISOR
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1da5072116ab..1e909f61e8b1 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -66,6 +66,9 @@ struct node_class_nodes {
>         unsigned                class;
>         nodemask_t              initiator_nodes;
>         nodemask_t              target_nodes;
> +#ifdef CONFIG_HMEM_REPORTING
> +       struct node_hmem_attrs  hmem_attrs;
> +#endif
>  };
>  #define to_class_nodes(dev) container_of(dev, struct node_class_nodes, dev)
>
> @@ -145,6 +148,51 @@ static struct node_class_nodes *node_init_node_class(struct device *parent,
>         return NULL;
>  }
>
> +#ifdef CONFIG_HMEM_REPORTING
> +#define ACCESS_ATTR(name)                                                 \
> +static ssize_t name##_show(struct device *dev,                            \
> +                          struct device_attribute *attr,                  \
> +                          char *buf)                                      \
> +{                                                                         \
> +       return sprintf(buf, "%u\n", to_class_nodes(dev)->hmem_attrs.name); \
> +}                                                                         \
> +static DEVICE_ATTR_RO(name);
> +
> +ACCESS_ATTR(read_bandwidth)
> +ACCESS_ATTR(read_latency)
> +ACCESS_ATTR(write_bandwidth)
> +ACCESS_ATTR(write_latency)
> +
> +static struct attribute *access_attrs[] = {
> +       &dev_attr_read_bandwidth.attr,
> +       &dev_attr_read_latency.attr,
> +       &dev_attr_write_bandwidth.attr,
> +       &dev_attr_write_latency.attr,
> +       NULL,
> +};
> +ATTRIBUTE_GROUPS(access);
> +

Kerneldoc?

And who is going to call this?

> +void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
> +                        unsigned class)
> +{
> +       struct node_class_nodes *c;
> +       struct node *node;
> +
> +       if (WARN_ON_ONCE(!node_online(nid)))
> +               return;
> +
> +       node = node_devices[nid];
> +       c = node_init_node_class(&node->dev, &node->class_list, class);
> +       if (!c)
> +               return;
> +
> +       c->hmem_attrs = *hmem_attrs;
> +       if (sysfs_create_groups(&c->dev.kobj, access_groups))
> +               pr_info("failed to add performance attribute group to node %d\n",
> +                       nid);
> +}
> +#endif
> +
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>  static ssize_t node_read_meminfo(struct device *dev,
>                         struct device_attribute *attr, char *buf)
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 8e3666c12ef2..e22940a593c2 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -20,6 +20,31 @@
>  #include <linux/list.h>
>  #include <linux/workqueue.h>
>
> +#ifdef CONFIG_HMEM_REPORTING
> +/**
> + * struct node_hmem_attrs - heterogeneous memory performance attributes
> + *
> + * @read_bandwidth:    Read bandwidth in MB/s
> + * @write_bandwidth:   Write bandwidth in MB/s
> + * @read_latency:      Read latency in nanoseconds
> + * @write_latency:     Write latency in nanoseconds
> + */
> +struct node_hmem_attrs {
> +       unsigned int read_bandwidth;
> +       unsigned int write_bandwidth;
> +       unsigned int read_latency;
> +       unsigned int write_latency;
> +};
> +void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
> +                        unsigned class);
> +#else
> +static inline void node_set_perf_attrs(unsigned int nid,
> +                                      struct node_hmem_attrs *hmem_attrs,
> +                                      unsigned class)
> +{
> +}

Have you tried to compile this with CONFIG_HMEM_REPORTING unset?

> +#endif
> +
>  struct node {
>         struct device   dev;
>         struct list_head class_list;
> --

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 08/13] Documentation/ABI: Add node performance attributes
  2019-01-16 17:57 ` [PATCHv4 08/13] Documentation/ABI: Add node performance attributes Keith Busch
@ 2019-01-17 15:09   ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 15:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Add descriptions for memory class initiator performance access attributes.

Again, I would combine this with the previous patch.

> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/ABI/stable/sysfs-devices-node | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index a9c47b4b0eee..2217557f29d3 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -114,3 +114,31 @@ Description:
>                 The node list of memory targets that this initiator node has
>                 class "Y" access. Memory accesses from this node to nodes not
>                 in this list may have differet performance.
> +
> +What:          /sys/devices/system/node/nodeX/classY/read_bandwidth
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               This node's read bandwidth in MB/s available to memory
> +               initiators in nodes found in this class's initiators_nodelist.
> +
> +What:          /sys/devices/system/node/nodeX/classY/read_latency
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               This node's read latency in nanoseconds available to memory
> +               initiators in nodes found in this class's initiators_nodelist.

I'm not sure if the term "read latency" is sufficient here.  Is this
the latency between sending a request and getting a response or
between sending the request and when the data actually becomes
available?

Moreover, is it the worst-case latency or the average latency?

> +
> +What:          /sys/devices/system/node/nodeX/classY/write_bandwidth
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               This node's write bandwidth in MB/s available to memory
> +               initiators in nodes found in this class's initiators_nodelist.
> +
> +What:          /sys/devices/system/node/nodeX/classY/write_latency
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               This node's write latency in nanoseconds available to memory
> +               initiators in nodes found in this class's initiators_nodelist.
> --

Same questions as for the read latency apply here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 09/13] acpi/hmat: Register performance attributes
  2019-01-16 17:58 ` [PATCHv4 09/13] acpi/hmat: Register " Keith Busch
@ 2019-01-17 15:21   ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 15:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Save the best performace access attributes and register these with the
> memory's node if HMAT provides the locality table. While HMAT does make
> it possible to know performance for all possible initiator-target
> pairings, we export only the best pairings at this time.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/acpi/hmat/Kconfig |  1 +
>  drivers/acpi/hmat/hmat.c  | 34 ++++++++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
>
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> index a4034d37a311..20a0e96ba58a 100644
> --- a/drivers/acpi/hmat/Kconfig
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -2,6 +2,7 @@
>  config ACPI_HMAT
>         bool "ACPI Heterogeneous Memory Attribute Table Support"
>         depends on ACPI_NUMA
> +       select HMEM_REPORTING

If you want HMEM_REPORTING to be only set when ACPI_HMAT is set, then
don't make HMEM_REPORTING user-selectable.

>         help
>          Parses representation of the ACPI Heterogeneous Memory Attributes
>          Table (HMAT) and set the memory node relationships and access
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index efb33c74d1a3..45e20dc677f9 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -23,6 +23,8 @@ struct memory_target {
>         struct list_head node;
>         unsigned int memory_pxm;
>         unsigned long p_nodes[BITS_TO_LONGS(MAX_NUMNODES)];
> +       bool hmem_valid;
> +       struct node_hmem_attrs hmem;
>  };
>
>  static __init struct memory_target *find_mem_target(unsigned int m)
> @@ -108,6 +110,34 @@ static __init void hmat_update_access(u8 type, u32 value, u32 *best)
>         }
>  }
>
> +static __init void hmat_update_target(struct memory_target *t, u8 type,
> +                                     u32 value)
> +{
> +       switch (type) {
> +       case ACPI_HMAT_ACCESS_LATENCY:
> +               t->hmem.read_latency = value;
> +               t->hmem.write_latency = value;
> +               break;
> +       case ACPI_HMAT_READ_LATENCY:
> +               t->hmem.read_latency = value;
> +               break;
> +       case ACPI_HMAT_WRITE_LATENCY:
> +               t->hmem.write_latency = value;
> +               break;
> +       case ACPI_HMAT_ACCESS_BANDWIDTH:
> +               t->hmem.read_bandwidth = value;
> +               t->hmem.write_bandwidth = value;
> +               break;
> +       case ACPI_HMAT_READ_BANDWIDTH:
> +               t->hmem.read_bandwidth = value;
> +               break;
> +       case ACPI_HMAT_WRITE_BANDWIDTH:
> +               t->hmem.write_bandwidth = value;
> +               break;
> +       }
> +       t->hmem_valid = true;

What if 'type' is none of the above?  After all these values come from
the firmware and that need not be correct.

Do you still want to set hmem_valid in that case?

> +}
> +
>  static __init int hmat_parse_locality(union acpi_subtable_headers *header,
>                                       const unsigned long end)
>  {
> @@ -166,6 +196,8 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
>                                         set_bit(p_node, t->p_nodes);
>                         }
>                 }
> +               if (t && best)
> +                       hmat_update_target(t, type, best);
>         }
>         return 0;
>  }
> @@ -267,6 +299,8 @@ static __init void hmat_register_targets(void)
>                 m = pxm_to_node(t->memory_pxm);
>                 for_each_set_bit(p, t->p_nodes, MAX_NUMNODES)
>                         register_memory_node_under_compute_node(m, p, 0);
> +               if (t->hmem_valid)
> +                       node_set_perf_attrs(m, &t->hmem, 0);
>                 kfree(t);
>         }
>  }
> --

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 07/13] node: Add heterogenous memory access attributes
  2019-01-17 15:03   ` Rafael J. Wysocki
@ 2019-01-17 15:41     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 51+ messages in thread
From: Greg Kroah-Hartman @ 2019-01-17 15:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Keith Busch, Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Dave Hansen, Dan Williams

On Thu, Jan 17, 2019 at 04:03:42PM +0100, Rafael J. Wysocki wrote:
>  On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > Heterogeneous memory systems provide memory nodes with different latency
> > and bandwidth performance attributes. Provide a new kernel interface for
> > subsystems to register the attributes under the memory target node's
> > initiator access class. If the system provides this information, applications
> > may query these attributes when deciding which node to request memory.
> >
> > The following example shows the new sysfs hierarchy for a node exporting
> > performance attributes:
> >
> >   # tree -P "read*|write*" /sys/devices/system/node/nodeY/classZ/
> >   /sys/devices/system/node/nodeY/classZ/
> >   |-- read_bandwidth
> >   |-- read_latency
> >   |-- write_bandwidth
> >   `-- write_latency
> >
> > The bandwidth is exported as MB/s and latency is reported in nanoseconds.
> > Memory accesses from an initiator node that is not one of the memory's
> > class "Z" initiator nodes may encounter different performance than
> > reported here. When a subsystem makes use of this interface, initiators
> > of a lower class number, "Z", have better performance relative to higher
> > class numbers. When provided, class 0 is the highest performing access
> > class.
> >
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  drivers/base/Kconfig |  8 ++++++++
> >  drivers/base/node.c  | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/node.h | 25 +++++++++++++++++++++++++
> >  3 files changed, 81 insertions(+)
> >
> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> > index 3e63a900b330..6014980238e8 100644
> > --- a/drivers/base/Kconfig
> > +++ b/drivers/base/Kconfig
> > @@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE
> >           unusable. You should say N here unless you are explicitly looking to
> >           test this functionality.
> >
> > +config HMEM_REPORTING
> > +       bool
> > +       default y

default y is only if the machine will not boot without it.  Please never
make a new option y unless you really really have to have it on all
machines in the world.

Hint, not here.

greg k-h

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 00/13] Heterogeneuos memory node attributes
  2019-01-17 12:58 ` [PATCHv4 00/13] Heterogeneuos memory node attributes Balbir Singh
@ 2019-01-17 15:44   ` Keith Busch
  2019-01-18 13:16     ` Balbir Singh
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-17 15:44 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, Jan 17, 2019 at 11:58:21PM +1100, Balbir Singh wrote:
> On Wed, Jan 16, 2019 at 10:57:51AM -0700, Keith Busch wrote:
> > It had previously been difficult to describe these setups as memory
> > rangers were generally lumped into the NUMA node of the CPUs. New
> > platform attributes have been created and in use today that describe
> > the more complex memory hierarchies that can be created.
> > 
> 
> Could you please expand on this text -- how are these attributes
> exposed/consumed by both the kernel and user space?
> 
> > This series' objective is to provide the attributes from such systems
> > that are useful for applications to know about, and readily usable with
> > existing tools and libraries.
> 
> I presume these tools and libraries are numactl and mbind()?

Yes, and numactl is used the examples provided in both changelogs and
documentation in this series. Do you want to see those in the cover
letter as well?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 10/13] node: Add memory caching attributes
  2019-01-16 17:58 ` [PATCHv4 10/13] node: Add memory caching attributes Keith Busch
@ 2019-01-17 16:00   ` Rafael J. Wysocki
  2019-02-09  8:20   ` Brice Goglin
  1 sibling, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 16:00 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> System memory may have side caches to help improve access speed to
> frequently requested address ranges. While the system provided cache is
> transparent to the software accessing these memory ranges, applications
> can optimize their own access based on cache attributes.
>
> Provide a new API for the kernel to register these memory side caches
> under the memory node that provides it.
>
> The new sysfs representation is modeled from the existing cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
> Unlike CPU cacheinfo, though, the node cache level is reported from
> the view of the memory. A higher number is nearer to the CPU, while
> lower levels are closer to the backing memory. Also unlike CPU cache,
> it is assumed the system will handle flushing any dirty cached memory
> to the last level on a power failure if the range is persistent memory.
>
> The attributes we export are the cache size, the line size, associativity,
> and write back policy.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/node.h |  39 ++++++++++++++
>  2 files changed, 181 insertions(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1e909f61e8b1..7ff3ed566d7d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -191,6 +191,146 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>                 pr_info("failed to add performance attribute group to node %d\n",
>                         nid);
>  }
> +
> +struct node_cache_info {
> +       struct device dev;
> +       struct list_head node;
> +       struct node_cache_attrs cache_attrs;
> +};
> +#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
> +
> +#define CACHE_ATTR(name, fmt)                                          \
> +static ssize_t name##_show(struct device *dev,                         \
> +                          struct device_attribute *attr,               \
> +                          char *buf)                                   \
> +{                                                                      \
> +       return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> +}                                                                      \
> +DEVICE_ATTR_RO(name);
> +
> +CACHE_ATTR(size, "%llu")
> +CACHE_ATTR(level, "%u")
> +CACHE_ATTR(line_size, "%u")
> +CACHE_ATTR(associativity, "%u")
> +CACHE_ATTR(write_policy, "%u")
> +
> +static struct attribute *cache_attrs[] = {
> +       &dev_attr_level.attr,
> +       &dev_attr_associativity.attr,
> +       &dev_attr_size.attr,
> +       &dev_attr_line_size.attr,
> +       &dev_attr_write_policy.attr,
> +       NULL,
> +};
> +ATTRIBUTE_GROUPS(cache);
> +
> +static void node_cache_release(struct device *dev)
> +{
> +       kfree(dev);
> +}
> +
> +static void node_cacheinfo_release(struct device *dev)
> +{
> +       struct node_cache_info *info = to_cache_info(dev);
> +       kfree(info);
> +}
> +
> +static void node_init_cache_dev(struct node *node)
> +{
> +       struct device *dev;
> +
> +       dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +       if (!dev)
> +               return;
> +
> +       dev->parent = &node->dev;
> +       dev->release = node_cache_release;
> +       if (dev_set_name(dev, "side_cache"))
> +               goto free_dev;
> +
> +       if (device_register(dev))
> +               goto free_name;
> +
> +       pm_runtime_no_callbacks(dev);
> +       node->cache_dev = dev;
> +       return;

I would add an empty line here.

> +free_name:
> +       kfree_const(dev->kobj.name);
> +free_dev:
> +       kfree(dev);
> +}
> +
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
> +{
> +       struct node_cache_info *info;
> +       struct device *dev;
> +       struct node *node;
> +
> +       if (!node_online(nid) || !node_devices[nid])
> +               return;
> +
> +       node = node_devices[nid];
> +       list_for_each_entry(info, &node->cache_attrs, node) {
> +               if (info->cache_attrs.level == cache_attrs->level) {
> +                       dev_warn(&node->dev,
> +                               "attempt to add duplicate cache level:%d\n",
> +                               cache_attrs->level);

I'd suggest using dev_dbg() for this and I'm not even sure if printing
the message is worth the effort.

Firmware will probably give you duplicates and users cannot do much
about fixing that anyway.

> +                       return;
> +               }
> +       }
> +
> +       if (!node->cache_dev)
> +               node_init_cache_dev(node);
> +       if (!node->cache_dev)
> +               return;
> +
> +       info = kzalloc(sizeof(*info), GFP_KERNEL);
> +       if (!info)
> +               return;
> +
> +       dev = &info->dev;
> +       dev->parent = node->cache_dev;
> +       dev->release = node_cacheinfo_release;
> +       dev->groups = cache_groups;
> +       if (dev_set_name(dev, "index%d", cache_attrs->level))
> +               goto free_cache;
> +
> +       info->cache_attrs = *cache_attrs;
> +       if (device_register(dev)) {
> +               dev_warn(&node->dev, "failed to add cache level:%d\n",
> +                        cache_attrs->level);
> +               goto free_name;
> +       }
> +       pm_runtime_no_callbacks(dev);
> +       list_add_tail(&info->node, &node->cache_attrs);
> +       return;

Again, I'd add an empty line here.

> +free_name:
> +       kfree_const(dev->kobj.name);
> +free_cache:
> +       kfree(info);
> +}
> +
> +static void node_remove_caches(struct node *node)
> +{
> +       struct node_cache_info *info, *next;
> +
> +       if (!node->cache_dev)
> +               return;
> +
> +       list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
> +               list_del(&info->node);
> +               device_unregister(&info->dev);
> +       }
> +       device_unregister(node->cache_dev);
> +}
> +
> +static void node_init_caches(unsigned int nid)
> +{
> +       INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
> +}
> +#else
> +static void node_init_caches(unsigned int nid) { }
> +static void node_remove_caches(struct node *node) { }
>  #endif
>
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
> @@ -475,6 +615,7 @@ void unregister_node(struct node *node)
>  {
>         hugetlb_unregister_node(node);          /* no-op, if memoryless node */
>         node_remove_classes(node);
> +       node_remove_caches(node);
>         device_unregister(&node->dev);
>  }
>
> @@ -755,6 +896,7 @@ int __register_one_node(int nid)
>         INIT_LIST_HEAD(&node_devices[nid]->class_list);
>         /* initialize work queue for memory hot plug */
>         init_node_hugetlb_work(nid);
> +       node_init_caches(nid);
>
>         return error;
>  }
> diff --git a/include/linux/node.h b/include/linux/node.h
> index e22940a593c2..8cdf2b2808e4 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -37,12 +37,47 @@ struct node_hmem_attrs {
>  };
>  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>                          unsigned class);
> +
> +enum cache_associativity {
> +       NODE_CACHE_DIRECT_MAP,
> +       NODE_CACHE_INDEXED,
> +       NODE_CACHE_OTHER,
> +};
> +
> +enum cache_write_policy {
> +       NODE_CACHE_WRITE_BACK,
> +       NODE_CACHE_WRITE_THROUGH,
> +       NODE_CACHE_WRITE_OTHER,
> +};
> +
> +/**
> + * struct node_cache_attrs - system memory caching attributes
> + *
> + * @associativity:     The ways memory blocks may be placed in cache
> + * @write_policy:      Write back or write through policy
> + * @size:              Total size of cache in bytes
> + * @line_size:         Number of bytes fetched on a cache miss
> + * @level:             Represents the cache hierarchy level
> + */
> +struct node_cache_attrs {
> +       enum cache_associativity associativity;
> +       enum cache_write_policy write_policy;
> +       u64 size;
> +       u16 line_size;
> +       u8  level;
> +};
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
>  #else
>  static inline void node_set_perf_attrs(unsigned int nid,
>                                        struct node_hmem_attrs *hmem_attrs,
>                                        unsigned class)
>  {
>  }
> +
> +static inline void node_add_cache(unsigned int nid,
> +                                 struct node_cache_attrs *cache_attrs)
> +{
> +}

And does this really build with CONFIG_HMEM_REPORTING unset?

>  #endif
>
>  struct node {
> @@ -51,6 +86,10 @@ struct node {
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>         struct work_struct      node_work;
>  #endif
> +#ifdef CONFIG_HMEM_REPORTING
> +       struct list_head cache_attrs;
> +       struct device *cache_dev;
> +#endif
>  };
>
>  struct memory_block;
> --

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 11/13] Documentation/ABI: Add node cache attributes
  2019-01-16 17:58 ` [PATCHv4 11/13] Documentation/ABI: Add node cache attributes Keith Busch
@ 2019-01-17 16:25   ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 16:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Add the attributes for the system memory side caches.

I really would combine this with the previous one.

> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/ABI/stable/sysfs-devices-node | 34 +++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 2217557f29d3..613d51fb52a3 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -142,3 +142,37 @@ Contact:   Keith Busch <keith.busch@intel.com>
>  Description:
>                 This node's write latency in nanoseconds available to memory
>                 initiators in nodes found in this class's initiators_nodelist.
> +
> +What:          /sys/devices/system/node/nodeX/side_cache/indexY/associativity
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The caches associativity: 0 for direct mapped, non-zero if
> +               indexed.
> +
> +What:          /sys/devices/system/node/nodeX/side_cache/indexY/level
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               This cache's level in the memory hierarchy. Matches 'Y' in the
> +               directory name.
> +
> +What:          /sys/devices/system/node/nodeX/side_cache/indexY/line_size
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The number of bytes accessed from the next cache level on a
> +               cache miss.
> +
> +What:          /sys/devices/system/node/nodeX/side_cache/indexY/size
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The size of this memory side cache in bytes.
> +
> +What:          /sys/devices/system/node/nodeX/side_cache/indexY/write_policy
> +Date:          December 2018
> +Contact:       Keith Busch <keith.busch@intel.com>
> +Description:
> +               The cache write policy: 0 for write-back, 1 for write-through,
> +               2 for other or unknown.
> --

It would be good to document the meaning of indexY itself too.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 06/13] acpi/hmat: Register processor domain to its memory
  2019-01-17 12:11   ` Rafael J. Wysocki
@ 2019-01-17 17:01     ` Dan Williams
  0 siblings, 0 replies; 51+ messages in thread
From: Dan Williams @ 2019-01-17 17:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Keith Busch, Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Dave Hansen

On Thu, Jan 17, 2019 at 4:11 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
>     On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > If the HMAT Subsystem Address Range provides a valid processor proximity
> > domain for a memory domain, or a processor domain with the highest
> > performing access exists, register the memory target with that initiator
> > so this relationship will be visible under the node's sysfs directory.
> >
> > Since HMAT requires valid address ranges have an equivalent SRAT entry,
> > verify each memory target satisfies this requirement.
>
> What exactly will happen after this patch?
>
> There will be some new directories under
> /sys/devices/system/node/nodeX/ if all goes well.  Anything else?

When / if the memory randomization series [1] makes its way upstream
there will be a follow-on patch to enable that randomization based on
the presence of a memory-side cache published in the HMAT.

[1]: https://lwn.net/Articles/767614/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 12/13] acpi/hmat: Register memory side cache attributes
  2019-01-16 17:58 ` [PATCHv4 12/13] acpi/hmat: Register memory side " Keith Busch
@ 2019-01-17 17:42   ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-17 17:42 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Rafael Wysocki,
	Dave Hansen, Dan Williams

On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Register memory side cache attributes with the memory's node if HMAT
> provides the side cache iniformation table.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/acpi/hmat/hmat.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
>
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 45e20dc677f9..9efdd0a63a79 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -206,6 +206,7 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
>                                    const unsigned long end)
>  {
>         struct acpi_hmat_cache *cache = (void *)header;
> +       struct node_cache_attrs cache_attrs;
>         u32 attrs;
>
>         if (cache->header.length < sizeof(*cache)) {
> @@ -219,6 +220,37 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
>                 cache->memory_PD, cache->cache_size, attrs,
>                 cache->number_of_SMBIOShandles);
>
> +       cache_attrs.size = cache->cache_size;
> +       cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4;
> +       cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16;
> +
> +       switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) {
> +       case ACPI_HMAT_CA_DIRECT_MAPPED:
> +               cache_attrs.associativity = NODE_CACHE_DIRECT_MAP;
> +               break;
> +       case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING:
> +               cache_attrs.associativity = NODE_CACHE_INDEXED;
> +               break;
> +       case ACPI_HMAT_CA_NONE:
> +       default:

This looks slightly odd as "default" covers the other case as well.
Maybe say what other case is covered by "default" in particular in a
comment?

> +               cache_attrs.associativity = NODE_CACHE_OTHER;
> +               break;
> +       }
> +
> +       switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) {
> +       case ACPI_HMAT_CP_WB:
> +               cache_attrs.write_policy = NODE_CACHE_WRITE_BACK;
> +               break;
> +       case ACPI_HMAT_CP_WT:
> +               cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH;
> +               break;
> +       case ACPI_HMAT_CP_NONE:
> +       default:

And analogously here.

> +               cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER;
> +               break;
> +       }
> +
> +       node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs);
>         return 0;
>  }
>
> --

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 00/13] Heterogeneuos memory node attributes
  2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
                   ` (13 preceding siblings ...)
  2019-01-17 12:58 ` [PATCHv4 00/13] Heterogeneuos memory node attributes Balbir Singh
@ 2019-01-17 18:18 ` Jonathan Cameron
  2019-01-17 19:47   ` Keith Busch
  14 siblings, 1 reply; 51+ messages in thread
From: Jonathan Cameron @ 2019-01-17 18:18 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams, linuxarm

On Wed, 16 Jan 2019 10:57:51 -0700
Keith Busch <keith.busch@intel.com> wrote:

> The series seems quite calm now. I've received some approvals of the
> on the proposal, and heard no objections on the new core interfaces.
> 
> Please let me know if there is anyone or group of people I should request
> and wait for a review. And if anyone reading this would like additional
> time as well before I post a potentially subsequent version, please let
> me know.
> 
> I also wanted to inquire on upstream strategy if/when all desired
> reviews are received. The series is spanning a few subsystems, so I'm
> not sure who's tree is the best candidate. I could see an argument for
> driver-core, acpi, or mm as possible paths. Please let me know if there's
> a more appropriate option or any other gating concerns.
> 
> == Changes from v3 ==
> 
>   I've fixed the documentation issues that have been raised for v3 
> 
>   Moved the hmat files according to Rafael's recommendation
> 
>   Added received Reviewed-by's
> 
> Otherwise this v4 is much the same as v3.
> 
> == Background ==
> 
> Platforms may provide multiple types of cpu attached system memory. The
> memory ranges for each type may have different characteristics that
> applications may wish to know about when considering what node they want
> their memory allocated from. 
> 
> It had previously been difficult to describe these setups as memory
> rangers were generally lumped into the NUMA node of the CPUs. New
> platform attributes have been created and in use today that describe
> the more complex memory hierarchies that can be created.
> 
> This series' objective is to provide the attributes from such systems
> that are useful for applications to know about, and readily usable with
> existing tools and libraries.
> 

Hi Keith, 

I've been having a play with various hand constructed HMAT tables to allow
me to try breaking them in all sorts of ways.

Mostly working as expected.

Two places I am so far unsure on...

1. Concept of 'best' is not implemented in a consistent fashion.

I don't agree with the logic to match on 'best' because it can give some counter
intuitive sets of target nodes.

For my simple test case we have both the latency and bandwidth specified (using
access as I'm lazy and it saves typing).

Rather that matching when both are the best value, we match when _any_ of the
measurements is the 'best' for the type of measurement.

A simple system with a high bandwidth interconnect between two SoCs
might well have identical bandwidths to memory connected to each node, but
much worse latency to the remote one.  Another simple case would be DDR and
SCM on roughly the same memory controller.  Bandwidths likely to be equal,
latencies very different.

Right now we get both nodes in the list of 'best' ones because the bandwidths
are equal which is far from ideal.  It also means we are presenting one value
for both latency and bandwidth, misrepresenting the ones where it doesn't apply.

If we aren't going to specify that both must be "best", then I think we should
separate the bandwidth and latency classes, requiring userspace to check
both if they want the best combination of latency and bandwidth. I'm also
happy enough (having not thought about it much) to have one class where the 'best'
is the value sorted first on best latency and then on best bandwidth.

2. Handling of memory only nodes - that might have a device attached - _PXM

This is a common situation in CCIX for example where you have an accelerator
with coherent memory homed at it. Looks like a pci device in a domain with
the memory.   Right now you can't actually do this as _PXM is processed
for pci devices, but we'll get that fixed (broken threadripper firmwares
meant it got reverted last cycle).

In my case I have 4 nodes with cpu and memory (0,1,2,3) and 2 memory only (4,5)
Memory only are longer latency and lower bandwidth.

Now
ls /sys/bus/nodes/devices/node0/class0/
...

initiator0
target0
target4
target5

read_bandwidth = 15000
read_latency = 10000

These two values (and their paired write values) are correct for initiator0 to target0
but completely wrong for initiator0 to target4 or target5.

This occurs because we loop over the targets looking for the best values and add
set the relevant bit in t->p_nodes based on that.  These memory only nodes have
a best value that happens to be equal from all the initiators.  The issue is it
isn't the one reported in the node0/class0.

Also if we look in
/sys/bus/nodes/devices/node4/class0 there are no targets listed (there are the expected
4 initiators 0-3).

I'm not sure what the intended behavior would be in this case.

I'll run some more tests tomorrow.

Jonathan

> Keith Busch (13):
>   acpi: Create subtable parsing infrastructure
>   acpi: Add HMAT to generic parsing tables
>   acpi/hmat: Parse and report heterogeneous memory
>   node: Link memory nodes to their compute nodes
>   Documentation/ABI: Add new node sysfs attributes
>   acpi/hmat: Register processor domain to its memory
>   node: Add heterogenous memory access attributes
>   Documentation/ABI: Add node performance attributes
>   acpi/hmat: Register performance attributes
>   node: Add memory caching attributes
>   Documentation/ABI: Add node cache attributes
>   acpi/hmat: Register memory side cache attributes
>   doc/mm: New documentation for memory performance
> 
>  Documentation/ABI/stable/sysfs-devices-node   |  87 +++++-
>  Documentation/admin-guide/mm/numaperf.rst     | 184 +++++++++++++
>  arch/arm64/kernel/acpi_numa.c                 |   2 +-
>  arch/arm64/kernel/smp.c                       |   4 +-
>  arch/ia64/kernel/acpi.c                       |  12 +-
>  arch/x86/kernel/acpi/boot.c                   |  36 +--
>  drivers/acpi/Kconfig                          |   1 +
>  drivers/acpi/Makefile                         |   1 +
>  drivers/acpi/hmat/Kconfig                     |   9 +
>  drivers/acpi/hmat/Makefile                    |   1 +
>  drivers/acpi/hmat/hmat.c                      | 375 ++++++++++++++++++++++++++
>  drivers/acpi/numa.c                           |  16 +-
>  drivers/acpi/scan.c                           |   4 +-
>  drivers/acpi/tables.c                         |  76 +++++-
>  drivers/base/Kconfig                          |   8 +
>  drivers/base/node.c                           | 317 +++++++++++++++++++++-
>  drivers/irqchip/irq-gic-v2m.c                 |   2 +-
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c      |   2 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +-
>  drivers/irqchip/irq-gic-v3-its.c              |   6 +-
>  drivers/irqchip/irq-gic-v3.c                  |  10 +-
>  drivers/irqchip/irq-gic.c                     |   4 +-
>  drivers/mailbox/pcc.c                         |   2 +-
>  include/linux/acpi.h                          |   6 +-
>  include/linux/node.h                          |  70 ++++-
>  25 files changed, 1172 insertions(+), 65 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>  create mode 100644 drivers/acpi/hmat/Kconfig
>  create mode 100644 drivers/acpi/hmat/Makefile
>  create mode 100644 drivers/acpi/hmat/hmat.c
> 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 00/13] Heterogeneuos memory node attributes
  2019-01-17 18:18 ` Jonathan Cameron
@ 2019-01-17 19:47   ` Keith Busch
  2019-01-18 11:12     ` Jonathan Cameron
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-17 19:47 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Hansen, Dave, Williams, Dan J, linuxarm

On Thu, Jan 17, 2019 at 10:18:35AM -0800, Jonathan Cameron wrote:
> I've been having a play with various hand constructed HMAT tables to allow
> me to try breaking them in all sorts of ways.
> 
> Mostly working as expected.
> 
> Two places I am so far unsure on...
> 
> 1. Concept of 'best' is not implemented in a consistent fashion.
> 
> I don't agree with the logic to match on 'best' because it can give some counter
> intuitive sets of target nodes.
> 
> For my simple test case we have both the latency and bandwidth specified (using
> access as I'm lazy and it saves typing).
> 
> Rather that matching when both are the best value, we match when _any_ of the
> measurements is the 'best' for the type of measurement.
> 
> A simple system with a high bandwidth interconnect between two SoCs
> might well have identical bandwidths to memory connected to each node, but
> much worse latency to the remote one.  Another simple case would be DDR and
> SCM on roughly the same memory controller.  Bandwidths likely to be equal,
> latencies very different.
> 
> Right now we get both nodes in the list of 'best' ones because the bandwidths
> are equal which is far from ideal.  It also means we are presenting one value
> for both latency and bandwidth, misrepresenting the ones where it doesn't apply.
> 
> If we aren't going to specify that both must be "best", then I think we should
> separate the bandwidth and latency classes, requiring userspace to check
> both if they want the best combination of latency and bandwidth. I'm also
> happy enough (having not thought about it much) to have one class where the 'best'
> is the value sorted first on best latency and then on best bandwidth.

Okay, I see what you mean. I must admit my test environment doesn't have
nodes with the same bandwith but different latency, so we may get the
wrong information with the HMAT parsing in this series. I'll look into
fixing that and consider your sugggestions.
 
> 2. Handling of memory only nodes - that might have a device attached - _PXM
> 
> This is a common situation in CCIX for example where you have an accelerator
> with coherent memory homed at it. Looks like a pci device in a domain with
> the memory.   Right now you can't actually do this as _PXM is processed
> for pci devices, but we'll get that fixed (broken threadripper firmwares
> meant it got reverted last cycle).
> 
> In my case I have 4 nodes with cpu and memory (0,1,2,3) and 2 memory only (4,5)
> Memory only are longer latency and lower bandwidth.
> 
> Now
> ls /sys/bus/nodes/devices/node0/class0/
> ...
> 
> initiator0
> target0
> target4
> target5
> 
> read_bandwidth = 15000
> read_latency = 10000
> 
> These two values (and their paired write values) are correct for initiator0 to target0
> but completely wrong for initiator0 to target4 or target5.

Hm, this wasn't intended to tell us performance for the initiator's
targets. The performance data here is when you access node0's memory
target from a node in its initiator_list, or one of the simlinked
initiatorX's.

If you want to see the performance attributes for accessing
initiator0->target4, you can check:

  /sys/devices/system/node/node0/class0/target4/class0/read_bandwidth

> This occurs because we loop over the targets looking for the best values and add
> set the relevant bit in t->p_nodes based on that.  These memory only nodes have
> a best value that happens to be equal from all the initiators.  The issue is it
> isn't the one reported in the node0/class0.
>
> Also if we look in
> /sys/bus/nodes/devices/node4/class0 there are no targets listed (there are the expected
> 4 initiators 0-3).
> 
> I'm not sure what the intended behavior would be in this case.

You mentioned that node 4 is a memory-only node, so it can't have any
targets, right?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 00/13] Heterogeneuos memory node attributes
  2019-01-17 19:47   ` Keith Busch
@ 2019-01-18 11:12     ` Jonathan Cameron
  0 siblings, 0 replies; 51+ messages in thread
From: Jonathan Cameron @ 2019-01-18 11:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Hansen, Dave, Williams, Dan J, linuxarm

On Thu, 17 Jan 2019 12:47:51 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Thu, Jan 17, 2019 at 10:18:35AM -0800, Jonathan Cameron wrote:
> > I've been having a play with various hand constructed HMAT tables to allow
> > me to try breaking them in all sorts of ways.
> > 
> > Mostly working as expected.
> > 
> > Two places I am so far unsure on...
> > 
> > 1. Concept of 'best' is not implemented in a consistent fashion.
> > 
> > I don't agree with the logic to match on 'best' because it can give some counter
> > intuitive sets of target nodes.
> > 
> > For my simple test case we have both the latency and bandwidth specified (using
> > access as I'm lazy and it saves typing).
> > 
> > Rather that matching when both are the best value, we match when _any_ of the
> > measurements is the 'best' for the type of measurement.
> > 
> > A simple system with a high bandwidth interconnect between two SoCs
> > might well have identical bandwidths to memory connected to each node, but
> > much worse latency to the remote one.  Another simple case would be DDR and
> > SCM on roughly the same memory controller.  Bandwidths likely to be equal,
> > latencies very different.
> > 
> > Right now we get both nodes in the list of 'best' ones because the bandwidths
> > are equal which is far from ideal.  It also means we are presenting one value
> > for both latency and bandwidth, misrepresenting the ones where it doesn't apply.
> > 
> > If we aren't going to specify that both must be "best", then I think we should
> > separate the bandwidth and latency classes, requiring userspace to check
> > both if they want the best combination of latency and bandwidth. I'm also
> > happy enough (having not thought about it much) to have one class where the 'best'
> > is the value sorted first on best latency and then on best bandwidth.  
> 
> Okay, I see what you mean. I must admit my test environment doesn't have
> nodes with the same bandwith but different latency, so we may get the
> wrong information with the HMAT parsing in this series. I'll look into
> fixing that and consider your sugggestions.

Great.

>  
> > 2. Handling of memory only nodes - that might have a device attached - _PXM
> > 
> > This is a common situation in CCIX for example where you have an accelerator
> > with coherent memory homed at it. Looks like a pci device in a domain with
> > the memory.   Right now you can't actually do this as _PXM is processed
> > for pci devices, but we'll get that fixed (broken threadripper firmwares
> > meant it got reverted last cycle).
> > 
> > In my case I have 4 nodes with cpu and memory (0,1,2,3) and 2 memory only (4,5)
> > Memory only are longer latency and lower bandwidth.
> > 
> > Now
> > ls /sys/bus/nodes/devices/node0/class0/
> > ...
> > 
> > initiator0
> > target0
> > target4
> > target5
> > 
> > read_bandwidth = 15000
> > read_latency = 10000
> > 
> > These two values (and their paired write values) are correct for initiator0 to target0
> > but completely wrong for initiator0 to target4 or target5.  
> 
> Hm, this wasn't intended to tell us performance for the initiator's
> targets. The performance data here is when you access node0's memory
> target from a node in its initiator_list, or one of the simlinked
> initiatorX's.

> 
> If you want to see the performance attributes for accessing
> initiator0->target4, you can check:
> 
>   /sys/devices/system/node/node0/class0/target4/class0/read_bandwidth

Ah.  That makes sense, but does raise the question of whether this interface
is rather unintuitive and that the example given in the docs for the PCI device
doesn't always work.  Perhaps it is that documentation that needs refining.

Having values that don't apply to particular combinations of entries
in the initiator_list and target_list based on which directory we are
in doesn't seem great to me.

So the presence of a target directory in a node indicates the memory
in the target is 'best' accessed from this node, but is unrelated to the
values provided in this node.

One thought is we are trying to combine two unrelated questions and that
is what is leading to the confusion.

1) I have a process (or similar) in this node, which is the 'best' memory
   to use and what are it's characteristics.

2) I have data in this memory, which processor node should I schedule my
   processing on.

Ideally we want to avoid searching all the nodes.
To test this I disabled the memory on one of my nodes to make it even more
pathological (this is valid on the real hardware as the cpus don't have to
have memory attached to them).  So same as before but now node 3 is initiator only.

Using bandwidth as a proxy for all the other measurements..

For question 1 (what memory)

Initiator in node 0

Need to check
node0/class0/target0/class0/read_bandwidth (can shortcut this one obviously)
node0/class0/target4/class0/read_bandwidth
node0/class0/target5/class0/read_bandwidth

and discover that it's smaller for node0/class0/target0

Initiator in node 3 (initiator only)
node3/class0/initiator_nodelist is empty so no useful information available.

Initiator in node 4 (pci card for example) can assume node 4 as no
other information and it has memory (which incidentally might have long
latencies compared to memory over the interconnect.).

For question 2 (what processor)

We are on better grounds

node0/class0/initiator0
node1/class0/initiator1
node2/class0/initiator2
node3 doesn't make sense (no memory)
node4/class0/initiator[0-3]
node5/class0/initiator[0-3]
All the memory / bandwidth numbers are as expected.


So my conclusion is this works fine for suggesting processor to use for
given memory (or accelerator or whatever, though only if they are closely
coupled with the processors). Doesn't work for the what memory to use
for a given processor / pci card etc.  Sometimes there is no answer,
sometimes you have to search to find it.

Does it make more sense to just have two classes

1) Which memory is nearest to me?
2) Which processor is nearest to me? 
(3) which processor of type X is nearest to me is harder to answer but useful).

Note that case 2 is clearly covered well by the existing, but I can't actually
see what benefit having the target links has for that use case.

To illustrate how I think that would work.

Class 0, existing but with target links dropped.

What processor for each memory
node0/class0/initiator_nodelist 0
node1/class0/initiator_nodelist 1
node2/class0/initiator_nodelist 2
node3/class0/initiator_nodelist ""
node4/class0/initiator_nodelist 0-3
node5/class0/initatior_nodelist 0-3
All the memory stats reflect from the value for the given initiator to access this memory.

Class 1 new one for the what memory is nearest to me - no initiator files as that's 'me'.
node0/class1/target_nodelist 0
node1/class1/target_nodelist 1
node2/class1/target_nodelist 2
node3/class1/target_nodelist 0-2
node4/class1/target_nodelist "" as not specified as an initiator in hmat. 
				Ideally class1 wouldn't even be there.
node5/class1/target_nodelist "" same

For now we would have no information for accelerators in node4, 5 but for now
ACPI doesn't provide us the info for them anyway really.  I suppose you could
have HMAT describing 'initiators' in those nodes without there being any processors
in SRAT.  Let's park that one to solve another day.

Your pci example then becomes

+  # NODE=$(cat /sys/devices/pci:0000:00/.../numa_node)
+  # numactl --membind=$(cat /sys/devices/node/node${NODE}/class1/target_nodelist) \
+      --cpunodebind=$(cat /sys/devices/node/node${NODE}/class0/initiator_nodelist) \
+      -- <some-program-to-execute>

What do you think?

Jonathan

> 
> > This occurs because we loop over the targets looking for the best values and add
> > set the relevant bit in t->p_nodes based on that.  These memory only nodes have
> > a best value that happens to be equal from all the initiators.  The issue is it
> > isn't the one reported in the node0/class0.
> >
> > Also if we look in
> > /sys/bus/nodes/devices/node4/class0 there are no targets listed (there are the expected
> > 4 initiators 0-3).
> > 
> > I'm not sure what the intended behavior would be in this case.  
> 
> You mentioned that node 4 is a memory-only node, so it can't have any
> targets, right?
Depends on what is there that ACPI 6.2 doesn't describe :)  Can have
PCI cards for example.

Jonathan



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-16 17:57 ` [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes Keith Busch
  2019-01-17 11:41   ` Rafael J. Wysocki
@ 2019-01-18 11:21   ` Jonathan Cameron
  2019-01-18 16:35     ` Dan Williams
  1 sibling, 1 reply; 51+ messages in thread
From: Jonathan Cameron @ 2019-01-18 11:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Wed, 16 Jan 2019 10:57:56 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Add entries for memory initiator and target node class attributes.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..a9c47b4b0eee 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:		December 2009
>  Contact:	Lee Schermerhorn <lee.schermerhorn@hp.com>
>  Description:
>  		The node's huge page size control/query attributes.
> -		See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +		See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:		/sys/devices/system/node/nodeX/classY/
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The node's relationship to other nodes for access class "Y".
> +
> +What:		/sys/devices/system/node/nodeX/classY/initiator_nodelist
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The node list of memory initiators that have class "Y" access
> +		to this node's memory. CPUs and other memory initiators in
> +		nodes not in the list accessing this node's memory may have
> +		different performance.
> +
> +What:		/sys/devices/system/node/nodeX/classY/target_nodelist
> +Date:		December 2018
> +Contact:	Keith Busch <keith.busch@intel.com>
> +Description:
> +		The node list of memory targets that this initiator node has
> +		class "Y" access. Memory accesses from this node to nodes not
> +		in this list may have differet performance.

Different performance from what?  In the other thread we established that
these target_nodelists are kind of a backwards reference, they all have
their characteristics anyway.  Perhaps this just needs to say:
"Memory access from this node to these targets may have different performance"?

i.e. Don't make the assumption I did that they should all be the same!




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 00/13] Heterogeneuos memory node attributes
  2019-01-17 15:44   ` Keith Busch
@ 2019-01-18 13:16     ` Balbir Singh
  0 siblings, 0 replies; 51+ messages in thread
From: Balbir Singh @ 2019-01-18 13:16 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, Jan 17, 2019 at 08:44:37AM -0700, Keith Busch wrote:
> On Thu, Jan 17, 2019 at 11:58:21PM +1100, Balbir Singh wrote:
> > On Wed, Jan 16, 2019 at 10:57:51AM -0700, Keith Busch wrote:
> > > It had previously been difficult to describe these setups as memory
> > > rangers were generally lumped into the NUMA node of the CPUs. New
> > > platform attributes have been created and in use today that describe
> > > the more complex memory hierarchies that can be created.
> > > 
> > 
> > Could you please expand on this text -- how are these attributes
> > exposed/consumed by both the kernel and user space?
> > 
> > > This series' objective is to provide the attributes from such systems
> > > that are useful for applications to know about, and readily usable with
> > > existing tools and libraries.
> > 
> > I presume these tools and libraries are numactl and mbind()?
> 
> Yes, and numactl is used the examples provided in both changelogs and
> documentation in this series. Do you want to see those in the cover
> letter as well?

Not really, I was just reading through the cover letter and was curious

Balbir Singh.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-18 11:21   ` Jonathan Cameron
@ 2019-01-18 16:35     ` Dan Williams
  0 siblings, 0 replies; 51+ messages in thread
From: Dan Williams @ 2019-01-18 16:35 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Keith Busch, Linux Kernel Mailing List, Linux ACPI, Linux MM,
	Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen

On Fri, Jan 18, 2019 at 3:22 AM Jonathan Cameron
<jonathan.cameron@huawei.com> wrote:
>
> On Wed, 16 Jan 2019 10:57:56 -0700
> Keith Busch <keith.busch@intel.com> wrote:
>
> > Add entries for memory initiator and target node class attributes.
> >
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-node
> > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > @@ -90,4 +90,27 @@ Date:              December 2009
> >  Contact:     Lee Schermerhorn <lee.schermerhorn@hp.com>
> >  Description:
> >               The node's huge page size control/query attributes.
> > -             See Documentation/admin-guide/mm/hugetlbpage.rst
> > \ No newline at end of file
> > +             See Documentation/admin-guide/mm/hugetlbpage.rst
> > +
> > +What:                /sys/devices/system/node/nodeX/classY/
> > +Date:                December 2018
> > +Contact:     Keith Busch <keith.busch@intel.com>
> > +Description:
> > +             The node's relationship to other nodes for access class "Y".
> > +
> > +What:                /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > +Date:                December 2018
> > +Contact:     Keith Busch <keith.busch@intel.com>
> > +Description:
> > +             The node list of memory initiators that have class "Y" access
> > +             to this node's memory. CPUs and other memory initiators in
> > +             nodes not in the list accessing this node's memory may have
> > +             different performance.
> > +
> > +What:                /sys/devices/system/node/nodeX/classY/target_nodelist
> > +Date:                December 2018
> > +Contact:     Keith Busch <keith.busch@intel.com>
> > +Description:
> > +             The node list of memory targets that this initiator node has
> > +             class "Y" access. Memory accesses from this node to nodes not
> > +             in this list may have differet performance.
>
> Different performance from what?  In the other thread we established that
> these target_nodelists are kind of a backwards reference, they all have
> their characteristics anyway.  Perhaps this just needs to say:
> "Memory access from this node to these targets may have different performance"?
>
> i.e. Don't make the assumption I did that they should all be the same!

I think a clarification of "class" is needed in this context. A
"class" is the the set of initiators that have the same rated
performance to a given target set. In other words "class" is a tuple
of (performance profile, initiator set, target set). Different
performance creates a different tuple / class.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-17 11:41   ` Rafael J. Wysocki
@ 2019-01-18 20:42     ` Keith Busch
  2019-01-18 21:08     ` Dan Williams
  1 sibling, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-01-18 20:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Dave Hansen,
	Dan Williams

On Thu, Jan 17, 2019 at 12:41:19PM +0100, Rafael J. Wysocki wrote:
> On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > Add entries for memory initiator and target node class attributes.
> >
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> 
> I would recommend combining this with the previous patch, as the way
> it is now I need to look at two patches at the time. :-)

Will do.
 
> > ---
> >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-node
> > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > @@ -90,4 +90,27 @@ Date:                December 2009
> >  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
> >  Description:
> >                 The node's huge page size control/query attributes.
> > -               See Documentation/admin-guide/mm/hugetlbpage.rst
> > \ No newline at end of file
> > +               See Documentation/admin-guide/mm/hugetlbpage.rst
> > +
> > +What:          /sys/devices/system/node/nodeX/classY/
> > +Date:          December 2018
> > +Contact:       Keith Busch <keith.busch@intel.com>
> > +Description:
> > +               The node's relationship to other nodes for access class "Y".
> > +
> > +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > +Date:          December 2018
> > +Contact:       Keith Busch <keith.busch@intel.com>
> > +Description:
> > +               The node list of memory initiators that have class "Y" access
> > +               to this node's memory. CPUs and other memory initiators in
> > +               nodes not in the list accessing this node's memory may have
> > +               different performance.
> 
> This does not follow the general "one value per file" rule of sysfs (I
> know that there are other sysfs files with more than one value in
> them, but it is better to follow this rule as long as that makes
> sense).

I think it actually does make sense to use "%*pbl" format here. The
type we're displaying is a bit array, so this is one value for that
attribute. There are many precedents for disapling bit arrays this
way. The existing node attributes that we're appending to show cpu
lists, and the top loevel shows node lists for "possible", "online",
"has_memory", "has_cpu". This new attribute should fit well with
exisiting similar attributes from this subsystem.
 
> > +
> > +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> > +Date:          December 2018
> > +Contact:       Keith Busch <keith.busch@intel.com>
> > +Description:
> > +               The node list of memory targets that this initiator node has
> > +               class "Y" access. Memory accesses from this node to nodes not
> > +               in this list may have differet performance.
> > --
> 
> Same here.
> 
> And if you follow the recommendation given in the previous message
> (add "initiators" and "targets" subdirs under "classX"), you won't
> even need the two files above.

I think that makes sense, I'll take try out this suggestion.
 
> And, of course, the symlinks part needs to be documented as well.  I
> guess you can follow the
> Documentation/ABI/testing/sysfs-devices-power_resources_D0 with that.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-17 11:41   ` Rafael J. Wysocki
  2019-01-18 20:42     ` Keith Busch
@ 2019-01-18 21:08     ` Dan Williams
  2019-01-19  9:01       ` Greg Kroah-Hartman
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Williams @ 2019-01-18 21:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Keith Busch, Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Dave Hansen

On Thu, Jan 17, 2019 at 3:41 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > Add entries for memory initiator and target node class attributes.
> >
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
>
> I would recommend combining this with the previous patch, as the way
> it is now I need to look at two patches at the time. :-)
>
> > ---
> >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-node
> > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > @@ -90,4 +90,27 @@ Date:                December 2009
> >  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
> >  Description:
> >                 The node's huge page size control/query attributes.
> > -               See Documentation/admin-guide/mm/hugetlbpage.rst
> > \ No newline at end of file
> > +               See Documentation/admin-guide/mm/hugetlbpage.rst
> > +
> > +What:          /sys/devices/system/node/nodeX/classY/
> > +Date:          December 2018
> > +Contact:       Keith Busch <keith.busch@intel.com>
> > +Description:
> > +               The node's relationship to other nodes for access class "Y".
> > +
> > +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > +Date:          December 2018
> > +Contact:       Keith Busch <keith.busch@intel.com>
> > +Description:
> > +               The node list of memory initiators that have class "Y" access
> > +               to this node's memory. CPUs and other memory initiators in
> > +               nodes not in the list accessing this node's memory may have
> > +               different performance.
>
> This does not follow the general "one value per file" rule of sysfs (I
> know that there are other sysfs files with more than one value in
> them, but it is better to follow this rule as long as that makes
> sense).
>
> > +
> > +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> > +Date:          December 2018
> > +Contact:       Keith Busch <keith.busch@intel.com>
> > +Description:
> > +               The node list of memory targets that this initiator node has
> > +               class "Y" access. Memory accesses from this node to nodes not
> > +               in this list may have differet performance.
> > --
>
> Same here.
>
> And if you follow the recommendation given in the previous message
> (add "initiators" and "targets" subdirs under "classX"), you won't
> even need the two files above.

This recommendation is in conflict with Greg's feedback about kobject
usage. If these are just "vanity" subdirs I think it's better to have
a multi-value sysfs file. This "list" style is already commonplace for
the /sys/devices/system hierarchy.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-18 21:08     ` Dan Williams
@ 2019-01-19  9:01       ` Greg Kroah-Hartman
  2019-01-19 16:56         ` Dan Williams
  2019-01-20 16:16         ` Rafael J. Wysocki
  0 siblings, 2 replies; 51+ messages in thread
From: Greg Kroah-Hartman @ 2019-01-19  9:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rafael J. Wysocki, Keith Busch, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen

On Fri, Jan 18, 2019 at 01:08:02PM -0800, Dan Williams wrote:
> On Thu, Jan 17, 2019 at 3:41 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> > >
> > > Add entries for memory initiator and target node class attributes.
> > >
> > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> >
> > I would recommend combining this with the previous patch, as the way
> > it is now I need to look at two patches at the time. :-)
> >
> > > ---
> > >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> > >  1 file changed, 24 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > > --- a/Documentation/ABI/stable/sysfs-devices-node
> > > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > > @@ -90,4 +90,27 @@ Date:                December 2009
> > >  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
> > >  Description:
> > >                 The node's huge page size control/query attributes.
> > > -               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > \ No newline at end of file
> > > +               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > +
> > > +What:          /sys/devices/system/node/nodeX/classY/
> > > +Date:          December 2018
> > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > +Description:
> > > +               The node's relationship to other nodes for access class "Y".
> > > +
> > > +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > > +Date:          December 2018
> > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > +Description:
> > > +               The node list of memory initiators that have class "Y" access
> > > +               to this node's memory. CPUs and other memory initiators in
> > > +               nodes not in the list accessing this node's memory may have
> > > +               different performance.
> >
> > This does not follow the general "one value per file" rule of sysfs (I
> > know that there are other sysfs files with more than one value in
> > them, but it is better to follow this rule as long as that makes
> > sense).
> >
> > > +
> > > +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> > > +Date:          December 2018
> > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > +Description:
> > > +               The node list of memory targets that this initiator node has
> > > +               class "Y" access. Memory accesses from this node to nodes not
> > > +               in this list may have differet performance.
> > > --
> >
> > Same here.
> >
> > And if you follow the recommendation given in the previous message
> > (add "initiators" and "targets" subdirs under "classX"), you won't
> > even need the two files above.
> 
> This recommendation is in conflict with Greg's feedback about kobject
> usage. If these are just "vanity" subdirs I think it's better to have
> a multi-value sysfs file. This "list" style is already commonplace for
> the /sys/devices/system hierarchy.

If you do a subdirectory "correctly" (i.e. a name for an attribute
group), that's fine.  Just do not ever create a kobject just for a
subdir, that will mess up userspace.

And I hate the "multi-value" sysfs files, where at all possible, please
do not copy past bad mistakes there.  If you can make them individual
files, please do that, it makes it easier to maintain and code the
kernel for.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-19  9:01       ` Greg Kroah-Hartman
@ 2019-01-19 16:56         ` Dan Williams
  2019-01-20 16:19           ` Rafael J. Wysocki
  2019-01-20 16:16         ` Rafael J. Wysocki
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Williams @ 2019-01-19 16:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Rafael J. Wysocki, Keith Busch, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen

On Sat, Jan 19, 2019 at 1:01 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Jan 18, 2019 at 01:08:02PM -0800, Dan Williams wrote:
> > On Thu, Jan 17, 2019 at 3:41 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> > > >
> > > > Add entries for memory initiator and target node class attributes.
> > > >
> > > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > >
> > > I would recommend combining this with the previous patch, as the way
> > > it is now I need to look at two patches at the time. :-)
> > >
> > > > ---
> > > >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> > > >  1 file changed, 24 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > > > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > > > --- a/Documentation/ABI/stable/sysfs-devices-node
> > > > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > > > @@ -90,4 +90,27 @@ Date:                December 2009
> > > >  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > >  Description:
> > > >                 The node's huge page size control/query attributes.
> > > > -               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > \ No newline at end of file
> > > > +               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > +
> > > > +What:          /sys/devices/system/node/nodeX/classY/
> > > > +Date:          December 2018
> > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > +Description:
> > > > +               The node's relationship to other nodes for access class "Y".
> > > > +
> > > > +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > > > +Date:          December 2018
> > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > +Description:
> > > > +               The node list of memory initiators that have class "Y" access
> > > > +               to this node's memory. CPUs and other memory initiators in
> > > > +               nodes not in the list accessing this node's memory may have
> > > > +               different performance.
> > >
> > > This does not follow the general "one value per file" rule of sysfs (I
> > > know that there are other sysfs files with more than one value in
> > > them, but it is better to follow this rule as long as that makes
> > > sense).
> > >
> > > > +
> > > > +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> > > > +Date:          December 2018
> > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > +Description:
> > > > +               The node list of memory targets that this initiator node has
> > > > +               class "Y" access. Memory accesses from this node to nodes not
> > > > +               in this list may have differet performance.
> > > > --
> > >
> > > Same here.
> > >
> > > And if you follow the recommendation given in the previous message
> > > (add "initiators" and "targets" subdirs under "classX"), you won't
> > > even need the two files above.
> >
> > This recommendation is in conflict with Greg's feedback about kobject
> > usage. If these are just "vanity" subdirs I think it's better to have
> > a multi-value sysfs file. This "list" style is already commonplace for
> > the /sys/devices/system hierarchy.
>
> If you do a subdirectory "correctly" (i.e. a name for an attribute
> group), that's fine.  Just do not ever create a kobject just for a
> subdir, that will mess up userspace.
>
> And I hate the "multi-value" sysfs files, where at all possible, please
> do not copy past bad mistakes there.  If you can make them individual
> files, please do that, it makes it easier to maintain and code the
> kernel for.

I agree in general about multi-value sysfs, but in this case we're
talking about a mask. Masks are a single value. That said I can get on
board with calling what 'cpulist' does a design mistake (human
readable mask), but otherwise switching to one file per item in the
mask is a mess for userspace to consume.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-19  9:01       ` Greg Kroah-Hartman
  2019-01-19 16:56         ` Dan Williams
@ 2019-01-20 16:16         ` Rafael J. Wysocki
  2019-01-22 16:36           ` Keith Busch
  1 sibling, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-20 16:16 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Dan Williams, Rafael J. Wysocki, Keith Busch,
	Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Dave Hansen

On Sat, Jan 19, 2019 at 10:01 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Jan 18, 2019 at 01:08:02PM -0800, Dan Williams wrote:
> > On Thu, Jan 17, 2019 at 3:41 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> > > >
> > > > Add entries for memory initiator and target node class attributes.
> > > >
> > > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > >
> > > I would recommend combining this with the previous patch, as the way
> > > it is now I need to look at two patches at the time. :-)
> > >
> > > > ---
> > > >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> > > >  1 file changed, 24 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > > > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > > > --- a/Documentation/ABI/stable/sysfs-devices-node
> > > > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > > > @@ -90,4 +90,27 @@ Date:                December 2009
> > > >  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > >  Description:
> > > >                 The node's huge page size control/query attributes.
> > > > -               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > \ No newline at end of file
> > > > +               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > +
> > > > +What:          /sys/devices/system/node/nodeX/classY/
> > > > +Date:          December 2018
> > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > +Description:
> > > > +               The node's relationship to other nodes for access class "Y".
> > > > +
> > > > +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > > > +Date:          December 2018
> > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > +Description:
> > > > +               The node list of memory initiators that have class "Y" access
> > > > +               to this node's memory. CPUs and other memory initiators in
> > > > +               nodes not in the list accessing this node's memory may have
> > > > +               different performance.
> > >
> > > This does not follow the general "one value per file" rule of sysfs (I
> > > know that there are other sysfs files with more than one value in
> > > them, but it is better to follow this rule as long as that makes
> > > sense).
> > >
> > > > +
> > > > +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> > > > +Date:          December 2018
> > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > +Description:
> > > > +               The node list of memory targets that this initiator node has
> > > > +               class "Y" access. Memory accesses from this node to nodes not
> > > > +               in this list may have differet performance.
> > > > --
> > >
> > > Same here.
> > >
> > > And if you follow the recommendation given in the previous message
> > > (add "initiators" and "targets" subdirs under "classX"), you won't
> > > even need the two files above.
> >
> > This recommendation is in conflict with Greg's feedback about kobject
> > usage. If these are just "vanity" subdirs I think it's better to have
> > a multi-value sysfs file. This "list" style is already commonplace for
> > the /sys/devices/system hierarchy.
>
> If you do a subdirectory "correctly" (i.e. a name for an attribute
> group), that's fine.

Yes, that's what I was thinking about: along the lines of the "power"
group under device kobjects.

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-19 16:56         ` Dan Williams
@ 2019-01-20 16:19           ` Rafael J. Wysocki
  2019-01-20 17:34             ` Dan Williams
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-20 16:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Keith Busch,
	Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Dave Hansen

On Sat, Jan 19, 2019 at 5:56 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Sat, Jan 19, 2019 at 1:01 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, Jan 18, 2019 at 01:08:02PM -0800, Dan Williams wrote:
> > > On Thu, Jan 17, 2019 at 3:41 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> > > > >
> > > > > Add entries for memory initiator and target node class attributes.
> > > > >
> > > > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > > >
> > > > I would recommend combining this with the previous patch, as the way
> > > > it is now I need to look at two patches at the time. :-)
> > > >
> > > > > ---
> > > > >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> > > > >  1 file changed, 24 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > > > > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > > > > --- a/Documentation/ABI/stable/sysfs-devices-node
> > > > > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > > > > @@ -90,4 +90,27 @@ Date:                December 2009
> > > > >  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > > >  Description:
> > > > >                 The node's huge page size control/query attributes.
> > > > > -               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > > \ No newline at end of file
> > > > > +               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > > +
> > > > > +What:          /sys/devices/system/node/nodeX/classY/
> > > > > +Date:          December 2018
> > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > +Description:
> > > > > +               The node's relationship to other nodes for access class "Y".
> > > > > +
> > > > > +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > > > > +Date:          December 2018
> > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > +Description:
> > > > > +               The node list of memory initiators that have class "Y" access
> > > > > +               to this node's memory. CPUs and other memory initiators in
> > > > > +               nodes not in the list accessing this node's memory may have
> > > > > +               different performance.
> > > >
> > > > This does not follow the general "one value per file" rule of sysfs (I
> > > > know that there are other sysfs files with more than one value in
> > > > them, but it is better to follow this rule as long as that makes
> > > > sense).
> > > >
> > > > > +
> > > > > +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> > > > > +Date:          December 2018
> > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > +Description:
> > > > > +               The node list of memory targets that this initiator node has
> > > > > +               class "Y" access. Memory accesses from this node to nodes not
> > > > > +               in this list may have differet performance.
> > > > > --
> > > >
> > > > Same here.
> > > >
> > > > And if you follow the recommendation given in the previous message
> > > > (add "initiators" and "targets" subdirs under "classX"), you won't
> > > > even need the two files above.
> > >
> > > This recommendation is in conflict with Greg's feedback about kobject
> > > usage. If these are just "vanity" subdirs I think it's better to have
> > > a multi-value sysfs file. This "list" style is already commonplace for
> > > the /sys/devices/system hierarchy.
> >
> > If you do a subdirectory "correctly" (i.e. a name for an attribute
> > group), that's fine.  Just do not ever create a kobject just for a
> > subdir, that will mess up userspace.
> >
> > And I hate the "multi-value" sysfs files, where at all possible, please
> > do not copy past bad mistakes there.  If you can make them individual
> > files, please do that, it makes it easier to maintain and code the
> > kernel for.
>
> I agree in general about multi-value sysfs, but in this case we're
> talking about a mask. Masks are a single value. That said I can get on
> board with calling what 'cpulist' does a design mistake (human
> readable mask), but otherwise switching to one file per item in the
> mask is a mess for userspace to consume.

Can you please refer to my response to Keith?

If you have "initiators" and "targets" under "classX" and a list of
symlinks in each of them, I don't see any kind of a mess here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-20 16:19           ` Rafael J. Wysocki
@ 2019-01-20 17:34             ` Dan Williams
  2019-01-21  9:54               ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Williams @ 2019-01-20 17:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, Keith Busch, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen

On Sun, Jan 20, 2019 at 8:20 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Sat, Jan 19, 2019 at 5:56 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Sat, Jan 19, 2019 at 1:01 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Fri, Jan 18, 2019 at 01:08:02PM -0800, Dan Williams wrote:
> > > > On Thu, Jan 17, 2019 at 3:41 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >
> > > > > On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> > > > > >
> > > > > > Add entries for memory initiator and target node class attributes.
> > > > > >
> > > > > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > > > >
> > > > > I would recommend combining this with the previous patch, as the way
> > > > > it is now I need to look at two patches at the time. :-)
> > > > >
> > > > > > ---
> > > > > >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> > > > > >  1 file changed, 24 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > > > > > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > > > > > --- a/Documentation/ABI/stable/sysfs-devices-node
> > > > > > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > > > > > @@ -90,4 +90,27 @@ Date:                December 2009
> > > > > >  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > > > >  Description:
> > > > > >                 The node's huge page size control/query attributes.
> > > > > > -               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > > > \ No newline at end of file
> > > > > > +               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > > > +
> > > > > > +What:          /sys/devices/system/node/nodeX/classY/
> > > > > > +Date:          December 2018
> > > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > > +Description:
> > > > > > +               The node's relationship to other nodes for access class "Y".
> > > > > > +
> > > > > > +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > > > > > +Date:          December 2018
> > > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > > +Description:
> > > > > > +               The node list of memory initiators that have class "Y" access
> > > > > > +               to this node's memory. CPUs and other memory initiators in
> > > > > > +               nodes not in the list accessing this node's memory may have
> > > > > > +               different performance.
> > > > >
> > > > > This does not follow the general "one value per file" rule of sysfs (I
> > > > > know that there are other sysfs files with more than one value in
> > > > > them, but it is better to follow this rule as long as that makes
> > > > > sense).
> > > > >
> > > > > > +
> > > > > > +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> > > > > > +Date:          December 2018
> > > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > > +Description:
> > > > > > +               The node list of memory targets that this initiator node has
> > > > > > +               class "Y" access. Memory accesses from this node to nodes not
> > > > > > +               in this list may have differet performance.
> > > > > > --
> > > > >
> > > > > Same here.
> > > > >
> > > > > And if you follow the recommendation given in the previous message
> > > > > (add "initiators" and "targets" subdirs under "classX"), you won't
> > > > > even need the two files above.
> > > >
> > > > This recommendation is in conflict with Greg's feedback about kobject
> > > > usage. If these are just "vanity" subdirs I think it's better to have
> > > > a multi-value sysfs file. This "list" style is already commonplace for
> > > > the /sys/devices/system hierarchy.
> > >
> > > If you do a subdirectory "correctly" (i.e. a name for an attribute
> > > group), that's fine.  Just do not ever create a kobject just for a
> > > subdir, that will mess up userspace.
> > >
> > > And I hate the "multi-value" sysfs files, where at all possible, please
> > > do not copy past bad mistakes there.  If you can make them individual
> > > files, please do that, it makes it easier to maintain and code the
> > > kernel for.
> >
> > I agree in general about multi-value sysfs, but in this case we're
> > talking about a mask. Masks are a single value. That said I can get on
> > board with calling what 'cpulist' does a design mistake (human
> > readable mask), but otherwise switching to one file per item in the
> > mask is a mess for userspace to consume.
>
> Can you please refer to my response to Keith?

Ah, ok I missed the patch4 comments and was reading this one in
isolation... which also bolsters your comment about squashing these
two patches together.

> If you have "initiators" and "targets" under "classX" and a list of
> symlinks in each of them, I don't see any kind of a mess here.

In this instance, I think having symlinks at all is "messy" vs just
having a mask. Yes, you're right, if we have the proposed symlinks
from patch4 there is no need for these _nodelist attributes, and those
symlinks would be better under "initiator" and "target" directories.
However, I'm arguing going the other way, just have the 2 mask
attributes and no symlinks. The HMAT erodes the concept of "numa
nodes" typically being a single digit number space per platform. Given
increasing numbers of memory target types and initiator devices its
going to be cumbersome to have userspace walk multiple symlinks vs
just reading a mask and opening the canonical path for a node
directly.

This is also part of the rationale for only emitting one "class"
(initiator / target performance profile) by default. There's an N-to-N
initiator-target description in the HMAT. When / if we decide emit
more classes the more work userspace would need to do to convert
directory structures back into data.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-20 17:34             ` Dan Williams
@ 2019-01-21  9:54               ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-21  9:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, Keith Busch,
	Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Dave Hansen

On Sun, Jan 20, 2019 at 6:34 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Sun, Jan 20, 2019 at 8:20 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Sat, Jan 19, 2019 at 5:56 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Sat, Jan 19, 2019 at 1:01 AM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Fri, Jan 18, 2019 at 01:08:02PM -0800, Dan Williams wrote:
> > > > > On Thu, Jan 17, 2019 at 3:41 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, Jan 16, 2019 at 6:59 PM Keith Busch <keith.busch@intel.com> wrote:
> > > > > > >
> > > > > > > Add entries for memory initiator and target node class attributes.
> > > > > > >
> > > > > > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > > > > >
> > > > > > I would recommend combining this with the previous patch, as the way
> > > > > > it is now I need to look at two patches at the time. :-)
> > > > > >
> > > > > > > ---
> > > > > > >  Documentation/ABI/stable/sysfs-devices-node | 25 ++++++++++++++++++++++++-
> > > > > > >  1 file changed, 24 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > > > > > > index 3e90e1f3bf0a..a9c47b4b0eee 100644
> > > > > > > --- a/Documentation/ABI/stable/sysfs-devices-node
> > > > > > > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > > > > > > @@ -90,4 +90,27 @@ Date:                December 2009
> > > > > > >  Contact:       Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > > > > >  Description:
> > > > > > >                 The node's huge page size control/query attributes.
> > > > > > > -               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > > > > \ No newline at end of file
> > > > > > > +               See Documentation/admin-guide/mm/hugetlbpage.rst
> > > > > > > +
> > > > > > > +What:          /sys/devices/system/node/nodeX/classY/
> > > > > > > +Date:          December 2018
> > > > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > > > +Description:
> > > > > > > +               The node's relationship to other nodes for access class "Y".
> > > > > > > +
> > > > > > > +What:          /sys/devices/system/node/nodeX/classY/initiator_nodelist
> > > > > > > +Date:          December 2018
> > > > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > > > +Description:
> > > > > > > +               The node list of memory initiators that have class "Y" access
> > > > > > > +               to this node's memory. CPUs and other memory initiators in
> > > > > > > +               nodes not in the list accessing this node's memory may have
> > > > > > > +               different performance.
> > > > > >
> > > > > > This does not follow the general "one value per file" rule of sysfs (I
> > > > > > know that there are other sysfs files with more than one value in
> > > > > > them, but it is better to follow this rule as long as that makes
> > > > > > sense).
> > > > > >
> > > > > > > +
> > > > > > > +What:          /sys/devices/system/node/nodeX/classY/target_nodelist
> > > > > > > +Date:          December 2018
> > > > > > > +Contact:       Keith Busch <keith.busch@intel.com>
> > > > > > > +Description:
> > > > > > > +               The node list of memory targets that this initiator node has
> > > > > > > +               class "Y" access. Memory accesses from this node to nodes not
> > > > > > > +               in this list may have differet performance.
> > > > > > > --
> > > > > >
> > > > > > Same here.
> > > > > >
> > > > > > And if you follow the recommendation given in the previous message
> > > > > > (add "initiators" and "targets" subdirs under "classX"), you won't
> > > > > > even need the two files above.
> > > > >
> > > > > This recommendation is in conflict with Greg's feedback about kobject
> > > > > usage. If these are just "vanity" subdirs I think it's better to have
> > > > > a multi-value sysfs file. This "list" style is already commonplace for
> > > > > the /sys/devices/system hierarchy.
> > > >
> > > > If you do a subdirectory "correctly" (i.e. a name for an attribute
> > > > group), that's fine.  Just do not ever create a kobject just for a
> > > > subdir, that will mess up userspace.
> > > >
> > > > And I hate the "multi-value" sysfs files, where at all possible, please
> > > > do not copy past bad mistakes there.  If you can make them individual
> > > > files, please do that, it makes it easier to maintain and code the
> > > > kernel for.
> > >
> > > I agree in general about multi-value sysfs, but in this case we're
> > > talking about a mask. Masks are a single value. That said I can get on
> > > board with calling what 'cpulist' does a design mistake (human
> > > readable mask), but otherwise switching to one file per item in the
> > > mask is a mess for userspace to consume.
> >
> > Can you please refer to my response to Keith?
>
> Ah, ok I missed the patch4 comments and was reading this one in
> isolation... which also bolsters your comment about squashing these
> two patches together.
>
> > If you have "initiators" and "targets" under "classX" and a list of
> > symlinks in each of them, I don't see any kind of a mess here.
>
> In this instance, I think having symlinks at all is "messy" vs just
> having a mask. Yes, you're right, if we have the proposed symlinks
> from patch4 there is no need for these _nodelist attributes, and those
> symlinks would be better under "initiator" and "target" directories.
> However, I'm arguing going the other way, just have the 2 mask
> attributes and no symlinks. The HMAT erodes the concept of "numa
> nodes" typically being a single digit number space per platform. Given
> increasing numbers of memory target types and initiator devices its
> going to be cumbersome to have userspace walk multiple symlinks vs
> just reading a mask and opening the canonical path for a node
> directly.

The symlinks only need to be walked once, however, and after walking
them (once) the user space program can simply create a mask for
itself.

To me, the question here is how to represent a graph in a filesystem,
and since nodes are naturally represented by directories, it is also
natural to represent connections between them as symlinks.

> This is also part of the rationale for only emitting one "class"
> (initiator / target performance profile) by default. There's an N-to-N
> initiator-target description in the HMAT. When / if we decide emit
> more classes the more work userspace would need to do to convert
> directory structures back into data.

I'm not convinced by this argument, as this conversion is a one-off
exercise.  Ultimately, a tool in user space would need to represent a
graph anyway and how it obtains the data to do that doesn't matter too
much to it.

However, IMO there is value in representing the graph in a filesystem
in such a way that, say, a file manager program (without special
modifications) can be used to walk it by a human.

Let alone the one-value-per-file rule of sysfs that doesn't need to be
violated if symlinks are used.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-20 16:16         ` Rafael J. Wysocki
@ 2019-01-22 16:36           ` Keith Busch
  2019-01-22 16:51             ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-01-22 16:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, Dan Williams, Linux Kernel Mailing List,
	ACPI Devel Maling List, Linux Memory Management List,
	Dave Hansen

On Sun, Jan 20, 2019 at 05:16:05PM +0100, Rafael J. Wysocki wrote:
> On Sat, Jan 19, 2019 at 10:01 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > If you do a subdirectory "correctly" (i.e. a name for an attribute
> > group), that's fine.
> 
> Yes, that's what I was thinking about: along the lines of the "power"
> group under device kobjects.

We can't append symlinks to an attribute group, though. I'd need to create
a lot of struct devices just to get the desired directory hiearchy. And
then each of those "devices" will have their own "power" group, which
really doesn't make any sense for what we're trying to show. Is that
really the right way to do this, or something else I'm missing?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-22 16:36           ` Keith Busch
@ 2019-01-22 16:51             ` Rafael J. Wysocki
  2019-01-22 16:54               ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-22 16:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, Dan Williams,
	Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Dave Hansen

On Tue, Jan 22, 2019 at 5:37 PM Keith Busch <keith.busch@intel.com> wrote:
>
> On Sun, Jan 20, 2019 at 05:16:05PM +0100, Rafael J. Wysocki wrote:
> > On Sat, Jan 19, 2019 at 10:01 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > If you do a subdirectory "correctly" (i.e. a name for an attribute
> > > group), that's fine.
> >
> > Yes, that's what I was thinking about: along the lines of the "power"
> > group under device kobjects.
>
> We can't append symlinks to an attribute group, though.

That's right, unfortunately.

> I'd need to create a lot of struct devices just to get the desired directory hiearchy.

No, you don't need to do that.  Kobjects can be added without
registering a struct device for each of them kind of along the lines
of what cpufreq does for its policy objects etc.  See
cpufreq_policy_alloc() and cpufreq_core_init() for examples.

> And then each of those "devices" will have their own "power" group, which
> really doesn't make any sense for what we're trying to show. Is that
> really the right way to do this, or something else I'm missing?

Above?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes
  2019-01-22 16:51             ` Rafael J. Wysocki
@ 2019-01-22 16:54               ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-01-22 16:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, Dan Williams,
	Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Dave Hansen

On Tue, Jan 22, 2019 at 5:51 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Jan 22, 2019 at 5:37 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > On Sun, Jan 20, 2019 at 05:16:05PM +0100, Rafael J. Wysocki wrote:
> > > On Sat, Jan 19, 2019 at 10:01 AM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > If you do a subdirectory "correctly" (i.e. a name for an attribute
> > > > group), that's fine.
> > >
> > > Yes, that's what I was thinking about: along the lines of the "power"
> > > group under device kobjects.
> >
> > We can't append symlinks to an attribute group, though.
>
> That's right, unfortunately.

Scratch this.

You can add them using sysfs_add_link_to_group().  For example, see
what acpi_power_expose_list() does.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 10/13] node: Add memory caching attributes
  2019-01-16 17:58 ` [PATCHv4 10/13] node: Add memory caching attributes Keith Busch
  2019-01-17 16:00   ` Rafael J. Wysocki
@ 2019-02-09  8:20   ` Brice Goglin
  2019-02-10 17:19     ` Jonathan Cameron
  1 sibling, 1 reply; 51+ messages in thread
From: Brice Goglin @ 2019-02-09  8:20 UTC (permalink / raw)
  To: Keith Busch, linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams

Hello Keith

Could we ever have a single side cache in front of two NUMA nodes ? I
don't see a way to find that out in the current implementation. Would we
have an "id" and/or "nodemap" bitmask in the sidecache structure ?

Thanks

Brice



Le 16/01/2019 à 18:58, Keith Busch a écrit :
> System memory may have side caches to help improve access speed to
> frequently requested address ranges. While the system provided cache is
> transparent to the software accessing these memory ranges, applications
> can optimize their own access based on cache attributes.
>
> Provide a new API for the kernel to register these memory side caches
> under the memory node that provides it.
>
> The new sysfs representation is modeled from the existing cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
> Unlike CPU cacheinfo, though, the node cache level is reported from
> the view of the memory. A higher number is nearer to the CPU, while
> lower levels are closer to the backing memory. Also unlike CPU cache,
> it is assumed the system will handle flushing any dirty cached memory
> to the last level on a power failure if the range is persistent memory.
>
> The attributes we export are the cache size, the line size, associativity,
> and write back policy.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/node.h |  39 ++++++++++++++
>  2 files changed, 181 insertions(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1e909f61e8b1..7ff3ed566d7d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -191,6 +191,146 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>  		pr_info("failed to add performance attribute group to node %d\n",
>  			nid);
>  }
> +
> +struct node_cache_info {
> +	struct device dev;
> +	struct list_head node;
> +	struct node_cache_attrs cache_attrs;
> +};
> +#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
> +
> +#define CACHE_ATTR(name, fmt) 						\
> +static ssize_t name##_show(struct device *dev,				\
> +			   struct device_attribute *attr,		\
> +			   char *buf)					\
> +{									\
> +	return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> +}									\
> +DEVICE_ATTR_RO(name);
> +
> +CACHE_ATTR(size, "%llu")
> +CACHE_ATTR(level, "%u")
> +CACHE_ATTR(line_size, "%u")
> +CACHE_ATTR(associativity, "%u")
> +CACHE_ATTR(write_policy, "%u")
> +
> +static struct attribute *cache_attrs[] = {
> +	&dev_attr_level.attr,
> +	&dev_attr_associativity.attr,
> +	&dev_attr_size.attr,
> +	&dev_attr_line_size.attr,
> +	&dev_attr_write_policy.attr,
> +	NULL,
> +};
> +ATTRIBUTE_GROUPS(cache);
> +
> +static void node_cache_release(struct device *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static void node_cacheinfo_release(struct device *dev)
> +{
> +	struct node_cache_info *info = to_cache_info(dev);
> +	kfree(info);
> +}
> +
> +static void node_init_cache_dev(struct node *node)
> +{
> +	struct device *dev;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return;
> +
> +	dev->parent = &node->dev;
> +	dev->release = node_cache_release;
> +	if (dev_set_name(dev, "side_cache"))
> +		goto free_dev;
> +
> +	if (device_register(dev))
> +		goto free_name;
> +
> +	pm_runtime_no_callbacks(dev);
> +	node->cache_dev = dev;
> +	return;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free_dev:
> +	kfree(dev);
> +}
> +
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
> +{
> +	struct node_cache_info *info;
> +	struct device *dev;
> +	struct node *node;
> +
> +	if (!node_online(nid) || !node_devices[nid])
> +		return;
> +
> +	node = node_devices[nid];
> +	list_for_each_entry(info, &node->cache_attrs, node) {
> +		if (info->cache_attrs.level == cache_attrs->level) {
> +			dev_warn(&node->dev,
> +				"attempt to add duplicate cache level:%d\n",
> +				cache_attrs->level);
> +			return;
> +		}
> +	}
> +
> +	if (!node->cache_dev)
> +		node_init_cache_dev(node);
> +	if (!node->cache_dev)
> +		return;
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info)
> +		return;
> +
> +	dev = &info->dev;
> +	dev->parent = node->cache_dev;
> +	dev->release = node_cacheinfo_release;
> +	dev->groups = cache_groups;
> +	if (dev_set_name(dev, "index%d", cache_attrs->level))
> +		goto free_cache;
> +
> +	info->cache_attrs = *cache_attrs;
> +	if (device_register(dev)) {
> +		dev_warn(&node->dev, "failed to add cache level:%d\n",
> +			 cache_attrs->level);
> +		goto free_name;
> +	}
> +	pm_runtime_no_callbacks(dev);
> +	list_add_tail(&info->node, &node->cache_attrs);
> +	return;
> +free_name:
> +	kfree_const(dev->kobj.name);
> +free_cache:
> +	kfree(info);
> +}
> +
> +static void node_remove_caches(struct node *node)
> +{
> +	struct node_cache_info *info, *next;
> +
> +	if (!node->cache_dev)
> +		return;
> +
> +	list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
> +		list_del(&info->node);
> +		device_unregister(&info->dev);
> +	}
> +	device_unregister(node->cache_dev);
> +}
> +
> +static void node_init_caches(unsigned int nid)
> +{
> +	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
> +}
> +#else
> +static void node_init_caches(unsigned int nid) { }
> +static void node_remove_caches(struct node *node) { }
>  #endif
>  
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
> @@ -475,6 +615,7 @@ void unregister_node(struct node *node)
>  {
>  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
>  	node_remove_classes(node);
> +	node_remove_caches(node);
>  	device_unregister(&node->dev);
>  }
>  
> @@ -755,6 +896,7 @@ int __register_one_node(int nid)
>  	INIT_LIST_HEAD(&node_devices[nid]->class_list);
>  	/* initialize work queue for memory hot plug */
>  	init_node_hugetlb_work(nid);
> +	node_init_caches(nid);
>  
>  	return error;
>  }
> diff --git a/include/linux/node.h b/include/linux/node.h
> index e22940a593c2..8cdf2b2808e4 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -37,12 +37,47 @@ struct node_hmem_attrs {
>  };
>  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
>  			 unsigned class);
> +
> +enum cache_associativity {
> +	NODE_CACHE_DIRECT_MAP,
> +	NODE_CACHE_INDEXED,
> +	NODE_CACHE_OTHER,
> +};
> +
> +enum cache_write_policy {
> +	NODE_CACHE_WRITE_BACK,
> +	NODE_CACHE_WRITE_THROUGH,
> +	NODE_CACHE_WRITE_OTHER,
> +};
> +
> +/**
> + * struct node_cache_attrs - system memory caching attributes
> + *
> + * @associativity:	The ways memory blocks may be placed in cache
> + * @write_policy:	Write back or write through policy
> + * @size:		Total size of cache in bytes
> + * @line_size:		Number of bytes fetched on a cache miss
> + * @level:		Represents the cache hierarchy level
> + */
> +struct node_cache_attrs {
> +	enum cache_associativity associativity;
> +	enum cache_write_policy write_policy;
> +	u64 size;
> +	u16 line_size;
> +	u8  level;
> +};
> +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
>  #else
>  static inline void node_set_perf_attrs(unsigned int nid,
>  				       struct node_hmem_attrs *hmem_attrs,
>  				       unsigned class)
>  {
>  }
> +
> +static inline void node_add_cache(unsigned int nid,
> +				  struct node_cache_attrs *cache_attrs)
> +{
> +}
>  #endif
>  
>  struct node {
> @@ -51,6 +86,10 @@ struct node {
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
>  	struct work_struct	node_work;
>  #endif
> +#ifdef CONFIG_HMEM_REPORTING
> +	struct list_head cache_attrs;
> +	struct device *cache_dev;
> +#endif
>  };
>  
>  struct memory_block;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 10/13] node: Add memory caching attributes
  2019-02-09  8:20   ` Brice Goglin
@ 2019-02-10 17:19     ` Jonathan Cameron
  2019-02-11 15:23       ` Keith Busch
  0 siblings, 1 reply; 51+ messages in thread
From: Jonathan Cameron @ 2019-02-10 17:19 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Keith Busch, linux-kernel, linux-acpi, linux-mm,
	Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams

On Sat, 9 Feb 2019 09:20:53 +0100
Brice Goglin <Brice.Goglin@inria.fr> wrote:

> Hello Keith
> 
> Could we ever have a single side cache in front of two NUMA nodes ? I
> don't see a way to find that out in the current implementation. Would we
> have an "id" and/or "nodemap" bitmask in the sidecache structure ?

This is certainly a possible thing for hardware to do.

ACPI IIRC doesn't provide any means of representing that - your best
option is to represent it as two different entries, one for each of the
memory nodes.  Interesting question of whether you would then claim
they were half as big each, or the full size.  Of course, there are
other possible ways to get this info beyond HMAT, so perhaps the interface
should allow it to be exposed if available?

Also, don't know if it's just me, but calling these sidecaches is
downright confusing.  In ACPI at least they are always
specifically referred to as Memory Side Caches.
I'd argue there should even by a hyphen Memory-Side Caches, the point
being that that they are on the memory side of the interconnected
rather than the processor side.  Of course an implementation
choice might be to put them off to the side (as implied by sidecaches)
in some sense, but it's not the only one.

</terminology rant> :)

Jonathan

> 
> Thanks
> 
> Brice
> 
> 
> 
> Le 16/01/2019 à 18:58, Keith Busch a écrit :
> > System memory may have side caches to help improve access speed to
> > frequently requested address ranges. While the system provided cache is
> > transparent to the software accessing these memory ranges, applications
> > can optimize their own access based on cache attributes.
> >
> > Provide a new API for the kernel to register these memory side caches
> > under the memory node that provides it.
> >
> > The new sysfs representation is modeled from the existing cpu cacheinfo
> > attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
> > Unlike CPU cacheinfo, though, the node cache level is reported from
> > the view of the memory. A higher number is nearer to the CPU, while
> > lower levels are closer to the backing memory. Also unlike CPU cache,
> > it is assumed the system will handle flushing any dirty cached memory
> > to the last level on a power failure if the range is persistent memory.
> >
> > The attributes we export are the cache size, the line size, associativity,
> > and write back policy.
> >
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  drivers/base/node.c  | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/node.h |  39 ++++++++++++++
> >  2 files changed, 181 insertions(+)
> >
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 1e909f61e8b1..7ff3ed566d7d 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -191,6 +191,146 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
> >  		pr_info("failed to add performance attribute group to node %d\n",
> >  			nid);
> >  }
> > +
> > +struct node_cache_info {
> > +	struct device dev;
> > +	struct list_head node;
> > +	struct node_cache_attrs cache_attrs;
> > +};
> > +#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
> > +
> > +#define CACHE_ATTR(name, fmt) 						\
> > +static ssize_t name##_show(struct device *dev,				\
> > +			   struct device_attribute *attr,		\
> > +			   char *buf)					\
> > +{									\
> > +	return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> > +}									\
> > +DEVICE_ATTR_RO(name);
> > +
> > +CACHE_ATTR(size, "%llu")
> > +CACHE_ATTR(level, "%u")
> > +CACHE_ATTR(line_size, "%u")
> > +CACHE_ATTR(associativity, "%u")
> > +CACHE_ATTR(write_policy, "%u")
> > +
> > +static struct attribute *cache_attrs[] = {
> > +	&dev_attr_level.attr,
> > +	&dev_attr_associativity.attr,
> > +	&dev_attr_size.attr,
> > +	&dev_attr_line_size.attr,
> > +	&dev_attr_write_policy.attr,
> > +	NULL,
> > +};
> > +ATTRIBUTE_GROUPS(cache);
> > +
> > +static void node_cache_release(struct device *dev)
> > +{
> > +	kfree(dev);
> > +}
> > +
> > +static void node_cacheinfo_release(struct device *dev)
> > +{
> > +	struct node_cache_info *info = to_cache_info(dev);
> > +	kfree(info);
> > +}
> > +
> > +static void node_init_cache_dev(struct node *node)
> > +{
> > +	struct device *dev;
> > +
> > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +	if (!dev)
> > +		return;
> > +
> > +	dev->parent = &node->dev;
> > +	dev->release = node_cache_release;
> > +	if (dev_set_name(dev, "side_cache"))
> > +		goto free_dev;
> > +
> > +	if (device_register(dev))
> > +		goto free_name;
> > +
> > +	pm_runtime_no_callbacks(dev);
> > +	node->cache_dev = dev;
> > +	return;
> > +free_name:
> > +	kfree_const(dev->kobj.name);
> > +free_dev:
> > +	kfree(dev);
> > +}
> > +
> > +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
> > +{
> > +	struct node_cache_info *info;
> > +	struct device *dev;
> > +	struct node *node;
> > +
> > +	if (!node_online(nid) || !node_devices[nid])
> > +		return;
> > +
> > +	node = node_devices[nid];
> > +	list_for_each_entry(info, &node->cache_attrs, node) {
> > +		if (info->cache_attrs.level == cache_attrs->level) {
> > +			dev_warn(&node->dev,
> > +				"attempt to add duplicate cache level:%d\n",
> > +				cache_attrs->level);
> > +			return;
> > +		}
> > +	}
> > +
> > +	if (!node->cache_dev)
> > +		node_init_cache_dev(node);
> > +	if (!node->cache_dev)
> > +		return;
> > +
> > +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> > +	if (!info)
> > +		return;
> > +
> > +	dev = &info->dev;
> > +	dev->parent = node->cache_dev;
> > +	dev->release = node_cacheinfo_release;
> > +	dev->groups = cache_groups;
> > +	if (dev_set_name(dev, "index%d", cache_attrs->level))
> > +		goto free_cache;
> > +
> > +	info->cache_attrs = *cache_attrs;
> > +	if (device_register(dev)) {
> > +		dev_warn(&node->dev, "failed to add cache level:%d\n",
> > +			 cache_attrs->level);
> > +		goto free_name;
> > +	}
> > +	pm_runtime_no_callbacks(dev);
> > +	list_add_tail(&info->node, &node->cache_attrs);
> > +	return;
> > +free_name:
> > +	kfree_const(dev->kobj.name);
> > +free_cache:
> > +	kfree(info);
> > +}
> > +
> > +static void node_remove_caches(struct node *node)
> > +{
> > +	struct node_cache_info *info, *next;
> > +
> > +	if (!node->cache_dev)
> > +		return;
> > +
> > +	list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
> > +		list_del(&info->node);
> > +		device_unregister(&info->dev);
> > +	}
> > +	device_unregister(node->cache_dev);
> > +}
> > +
> > +static void node_init_caches(unsigned int nid)
> > +{
> > +	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
> > +}
> > +#else
> > +static void node_init_caches(unsigned int nid) { }
> > +static void node_remove_caches(struct node *node) { }
> >  #endif
> >  
> >  #define K(x) ((x) << (PAGE_SHIFT - 10))
> > @@ -475,6 +615,7 @@ void unregister_node(struct node *node)
> >  {
> >  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
> >  	node_remove_classes(node);
> > +	node_remove_caches(node);
> >  	device_unregister(&node->dev);
> >  }
> >  
> > @@ -755,6 +896,7 @@ int __register_one_node(int nid)
> >  	INIT_LIST_HEAD(&node_devices[nid]->class_list);
> >  	/* initialize work queue for memory hot plug */
> >  	init_node_hugetlb_work(nid);
> > +	node_init_caches(nid);
> >  
> >  	return error;
> >  }
> > diff --git a/include/linux/node.h b/include/linux/node.h
> > index e22940a593c2..8cdf2b2808e4 100644
> > --- a/include/linux/node.h
> > +++ b/include/linux/node.h
> > @@ -37,12 +37,47 @@ struct node_hmem_attrs {
> >  };
> >  void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs,
> >  			 unsigned class);
> > +
> > +enum cache_associativity {
> > +	NODE_CACHE_DIRECT_MAP,
> > +	NODE_CACHE_INDEXED,
> > +	NODE_CACHE_OTHER,
> > +};
> > +
> > +enum cache_write_policy {
> > +	NODE_CACHE_WRITE_BACK,
> > +	NODE_CACHE_WRITE_THROUGH,
> > +	NODE_CACHE_WRITE_OTHER,
> > +};
> > +
> > +/**
> > + * struct node_cache_attrs - system memory caching attributes
> > + *
> > + * @associativity:	The ways memory blocks may be placed in cache
> > + * @write_policy:	Write back or write through policy
> > + * @size:		Total size of cache in bytes
> > + * @line_size:		Number of bytes fetched on a cache miss
> > + * @level:		Represents the cache hierarchy level
> > + */
> > +struct node_cache_attrs {
> > +	enum cache_associativity associativity;
> > +	enum cache_write_policy write_policy;
> > +	u64 size;
> > +	u16 line_size;
> > +	u8  level;
> > +};
> > +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
> >  #else
> >  static inline void node_set_perf_attrs(unsigned int nid,
> >  				       struct node_hmem_attrs *hmem_attrs,
> >  				       unsigned class)
> >  {
> >  }
> > +
> > +static inline void node_add_cache(unsigned int nid,
> > +				  struct node_cache_attrs *cache_attrs)
> > +{
> > +}
> >  #endif
> >  
> >  struct node {
> > @@ -51,6 +86,10 @@ struct node {
> >  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
> >  	struct work_struct	node_work;
> >  #endif
> > +#ifdef CONFIG_HMEM_REPORTING
> > +	struct list_head cache_attrs;
> > +	struct device *cache_dev;
> > +#endif
> >  };
> >  
> >  struct memory_block;  
> 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 10/13] node: Add memory caching attributes
  2019-02-10 17:19     ` Jonathan Cameron
@ 2019-02-11 15:23       ` Keith Busch
  2019-02-12  8:11         ` Brice Goglin
  2019-02-12  8:49         ` Jonathan Cameron
  0 siblings, 2 replies; 51+ messages in thread
From: Keith Busch @ 2019-02-11 15:23 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Brice Goglin, linux-kernel, linux-acpi, linux-mm,
	Greg Kroah-Hartman, Rafael Wysocki, Hansen, Dave, Williams,
	Dan J

On Sun, Feb 10, 2019 at 09:19:58AM -0800, Jonathan Cameron wrote:
> On Sat, 9 Feb 2019 09:20:53 +0100
> Brice Goglin <Brice.Goglin@inria.fr> wrote:
> 
> > Hello Keith
> > 
> > Could we ever have a single side cache in front of two NUMA nodes ? I
> > don't see a way to find that out in the current implementation. Would we
> > have an "id" and/or "nodemap" bitmask in the sidecache structure ?
> 
> This is certainly a possible thing for hardware to do.
>
> ACPI IIRC doesn't provide any means of representing that - your best
> option is to represent it as two different entries, one for each of the
> memory nodes.  Interesting question of whether you would then claim
> they were half as big each, or the full size.  Of course, there are
> other possible ways to get this info beyond HMAT, so perhaps the interface
> should allow it to be exposed if available?

HMAT doesn't do this, but I want this interface abstracted enough from
HMAT to express whatever is necessary.

The CPU cache is the closest existing exported attributes to this,
and they provide "shared_cpu_list". To that end, I can export a
"shared_node_list", though previous reviews strongly disliked multi-value
sysfs entries. :(

Would shared-node symlinks capture the need, and more acceptable?
 
> Also, don't know if it's just me, but calling these sidecaches is
> downright confusing.  In ACPI at least they are always
> specifically referred to as Memory Side Caches.
> I'd argue there should even by a hyphen Memory-Side Caches, the point
> being that that they are on the memory side of the interconnected
> rather than the processor side.  Of course an implementation
> choice might be to put them off to the side (as implied by sidecaches)
> in some sense, but it's not the only one.
> 
> </terminology rant> :)

Now that you mention it, I agree "side" is ambiguous.  Maybe call it
"numa_cache" or "node_cache"?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 10/13] node: Add memory caching attributes
  2019-02-11 15:23       ` Keith Busch
@ 2019-02-12  8:11         ` Brice Goglin
  2019-02-12  8:49         ` Jonathan Cameron
  1 sibling, 0 replies; 51+ messages in thread
From: Brice Goglin @ 2019-02-12  8:11 UTC (permalink / raw)
  To: Keith Busch, Jonathan Cameron
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Hansen, Dave, Williams, Dan J


Le 11/02/2019 à 16:23, Keith Busch a écrit :
> On Sun, Feb 10, 2019 at 09:19:58AM -0800, Jonathan Cameron wrote:
>> On Sat, 9 Feb 2019 09:20:53 +0100
>> Brice Goglin <Brice.Goglin@inria.fr> wrote:
>>
>>> Hello Keith
>>>
>>> Could we ever have a single side cache in front of two NUMA nodes ? I
>>> don't see a way to find that out in the current implementation. Would we
>>> have an "id" and/or "nodemap" bitmask in the sidecache structure ?
>> This is certainly a possible thing for hardware to do.
>>
>> ACPI IIRC doesn't provide any means of representing that - your best
>> option is to represent it as two different entries, one for each of the
>> memory nodes.  Interesting question of whether you would then claim
>> they were half as big each, or the full size.  Of course, there are
>> other possible ways to get this info beyond HMAT, so perhaps the interface
>> should allow it to be exposed if available?
> HMAT doesn't do this, but I want this interface abstracted enough from
> HMAT to express whatever is necessary.
>
> The CPU cache is the closest existing exported attributes to this,
> and they provide "shared_cpu_list". To that end, I can export a
> "shared_node_list", though previous reviews strongly disliked multi-value
> sysfs entries. :(
>
> Would shared-node symlinks capture the need, and more acceptable?


As a user-space guy reading these files/symlinks, I would prefer reading
a bitmask just like we do for CPU cache "cpumap" or CPU "siblings" files
(or sibling_list).

Reading a directory and looking for dentries matching "foo%d" is far
less convenient  in C. If all these files are inside a dedicated
subdirectory, it's better but still not as easy.

Brice



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 10/13] node: Add memory caching attributes
  2019-02-11 15:23       ` Keith Busch
  2019-02-12  8:11         ` Brice Goglin
@ 2019-02-12  8:49         ` Jonathan Cameron
  2019-02-12 17:31           ` Keith Busch
  1 sibling, 1 reply; 51+ messages in thread
From: Jonathan Cameron @ 2019-02-12  8:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Brice Goglin, linux-kernel, linux-acpi, linux-mm,
	Greg Kroah-Hartman, Rafael Wysocki, Hansen, Dave, Williams,
	Dan J

On Mon, 11 Feb 2019 08:23:04 -0700
Keith Busch <keith.busch@intel.com> wrote:

> On Sun, Feb 10, 2019 at 09:19:58AM -0800, Jonathan Cameron wrote:
> > On Sat, 9 Feb 2019 09:20:53 +0100
> > Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >   
> > > Hello Keith
> > > 
> > > Could we ever have a single side cache in front of two NUMA nodes ? I
> > > don't see a way to find that out in the current implementation. Would we
> > > have an "id" and/or "nodemap" bitmask in the sidecache structure ?  
> > 
> > This is certainly a possible thing for hardware to do.
> >
> > ACPI IIRC doesn't provide any means of representing that - your best
> > option is to represent it as two different entries, one for each of the
> > memory nodes.  Interesting question of whether you would then claim
> > they were half as big each, or the full size.  Of course, there are
> > other possible ways to get this info beyond HMAT, so perhaps the interface
> > should allow it to be exposed if available?  
> 
> HMAT doesn't do this, but I want this interface abstracted enough from
> HMAT to express whatever is necessary.
> 
> The CPU cache is the closest existing exported attributes to this,
> and they provide "shared_cpu_list". To that end, I can export a
> "shared_node_list", though previous reviews strongly disliked multi-value
> sysfs entries. :(
> 
> Would shared-node symlinks capture the need, and more acceptable?

My inclination is that it's better to follow an existing pattern than
invent a new one that breaks people's expectations.

However, don't feel that strongly about it as long as the interface
is functional and intuitive.


>  
> > Also, don't know if it's just me, but calling these sidecaches is
> > downright confusing.  In ACPI at least they are always
> > specifically referred to as Memory Side Caches.
> > I'd argue there should even by a hyphen Memory-Side Caches, the point
> > being that that they are on the memory side of the interconnected
> > rather than the processor side.  Of course an implementation
> > choice might be to put them off to the side (as implied by sidecaches)
> > in some sense, but it's not the only one.
> > 
> > </terminology rant> :)  
> 
> Now that you mention it, I agree "side" is ambiguous.  Maybe call it
> "numa_cache" or "node_cache"?
I'm not sure any of the options work well.  My inclination would be to
use the full name and keep the somewhat redundant memory there.

The other two feel like they could just as easily be coherent caches
at accelerators for example...

memory_side_cache?

The fun of naming ;)

Jonathan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHv4 10/13] node: Add memory caching attributes
  2019-02-12  8:49         ` Jonathan Cameron
@ 2019-02-12 17:31           ` Keith Busch
  0 siblings, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-02-12 17:31 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Brice Goglin, linux-kernel, linux-acpi, linux-mm,
	Greg Kroah-Hartman, Rafael Wysocki, Hansen, Dave, Williams,
	Dan J

On Tue, Feb 12, 2019 at 08:49:03AM +0000, Jonathan Cameron wrote:
> On Mon, 11 Feb 2019 08:23:04 -0700
> Keith Busch <keith.busch@intel.com> wrote:
> 
> > On Sun, Feb 10, 2019 at 09:19:58AM -0800, Jonathan Cameron wrote:
> > > On Sat, 9 Feb 2019 09:20:53 +0100
> > > Brice Goglin <Brice.Goglin@inria.fr> wrote:
> > >   
> > > > Hello Keith
> > > > 
> > > > Could we ever have a single side cache in front of two NUMA nodes ? I
> > > > don't see a way to find that out in the current implementation. Would we
> > > > have an "id" and/or "nodemap" bitmask in the sidecache structure ?  
> > > 
> > > This is certainly a possible thing for hardware to do.
> > >
> > > ACPI IIRC doesn't provide any means of representing that - your best
> > > option is to represent it as two different entries, one for each of the
> > > memory nodes.  Interesting question of whether you would then claim
> > > they were half as big each, or the full size.  Of course, there are
> > > other possible ways to get this info beyond HMAT, so perhaps the interface
> > > should allow it to be exposed if available?  
> > 
> > HMAT doesn't do this, but I want this interface abstracted enough from
> > HMAT to express whatever is necessary.
> > 
> > The CPU cache is the closest existing exported attributes to this,
> > and they provide "shared_cpu_list". To that end, I can export a
> > "shared_node_list", though previous reviews strongly disliked multi-value
> > sysfs entries. :(
> > 
> > Would shared-node symlinks capture the need, and more acceptable?
> 
> My inclination is that it's better to follow an existing pattern than
> invent a new one that breaks people's expectations.
> 
> However, don't feel that strongly about it as long as the interface
> is functional and intuitive.

Okay, considering I'd have a difficult time testing such an interface
since it doesn't apply to HMAT, and I've received only conflicting
feedback on list attributes, I would prefer to leave this feature out
of this series for now. I'm certainly not against adding it later.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2019-02-12 17:31 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-16 17:57 [PATCHv4 00/13] Heterogeneuos memory node attributes Keith Busch
2019-01-16 17:57 ` [PATCHv4 01/13] acpi: Create subtable parsing infrastructure Keith Busch
2019-01-16 17:57 ` [PATCHv4 02/13] acpi: Add HMAT to generic parsing tables Keith Busch
2019-01-16 17:57 ` [PATCHv4 03/13] acpi/hmat: Parse and report heterogeneous memory Keith Busch
2019-01-17 11:00   ` Rafael J. Wysocki
2019-01-16 17:57 ` [PATCHv4 04/13] node: Link memory nodes to their compute nodes Keith Busch
2019-01-17 11:26   ` Rafael J. Wysocki
2019-01-16 17:57 ` [PATCHv4 05/13] Documentation/ABI: Add new node sysfs attributes Keith Busch
2019-01-17 11:41   ` Rafael J. Wysocki
2019-01-18 20:42     ` Keith Busch
2019-01-18 21:08     ` Dan Williams
2019-01-19  9:01       ` Greg Kroah-Hartman
2019-01-19 16:56         ` Dan Williams
2019-01-20 16:19           ` Rafael J. Wysocki
2019-01-20 17:34             ` Dan Williams
2019-01-21  9:54               ` Rafael J. Wysocki
2019-01-20 16:16         ` Rafael J. Wysocki
2019-01-22 16:36           ` Keith Busch
2019-01-22 16:51             ` Rafael J. Wysocki
2019-01-22 16:54               ` Rafael J. Wysocki
2019-01-18 11:21   ` Jonathan Cameron
2019-01-18 16:35     ` Dan Williams
2019-01-16 17:57 ` [PATCHv4 06/13] acpi/hmat: Register processor domain to its memory Keith Busch
2019-01-17 12:11   ` Rafael J. Wysocki
2019-01-17 17:01     ` Dan Williams
2019-01-16 17:57 ` [PATCHv4 07/13] node: Add heterogenous memory access attributes Keith Busch
2019-01-17 15:03   ` Rafael J. Wysocki
2019-01-17 15:41     ` Greg Kroah-Hartman
2019-01-16 17:57 ` [PATCHv4 08/13] Documentation/ABI: Add node performance attributes Keith Busch
2019-01-17 15:09   ` Rafael J. Wysocki
2019-01-16 17:58 ` [PATCHv4 09/13] acpi/hmat: Register " Keith Busch
2019-01-17 15:21   ` Rafael J. Wysocki
2019-01-16 17:58 ` [PATCHv4 10/13] node: Add memory caching attributes Keith Busch
2019-01-17 16:00   ` Rafael J. Wysocki
2019-02-09  8:20   ` Brice Goglin
2019-02-10 17:19     ` Jonathan Cameron
2019-02-11 15:23       ` Keith Busch
2019-02-12  8:11         ` Brice Goglin
2019-02-12  8:49         ` Jonathan Cameron
2019-02-12 17:31           ` Keith Busch
2019-01-16 17:58 ` [PATCHv4 11/13] Documentation/ABI: Add node cache attributes Keith Busch
2019-01-17 16:25   ` Rafael J. Wysocki
2019-01-16 17:58 ` [PATCHv4 12/13] acpi/hmat: Register memory side " Keith Busch
2019-01-17 17:42   ` Rafael J. Wysocki
2019-01-16 17:58 ` [PATCHv4 13/13] doc/mm: New documentation for memory performance Keith Busch
2019-01-17 12:58 ` [PATCHv4 00/13] Heterogeneuos memory node attributes Balbir Singh
2019-01-17 15:44   ` Keith Busch
2019-01-18 13:16     ` Balbir Singh
2019-01-17 18:18 ` Jonathan Cameron
2019-01-17 19:47   ` Keith Busch
2019-01-18 11:12     ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).