linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/7] node: Link memory nodes to their compute nodes
@ 2018-11-14 22:49 Keith Busch
  2018-11-14 22:49 ` [PATCH 2/7] node: Add heterogenous memory performance Keith Busch
                   ` (7 more replies)
  0 siblings, 8 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-14 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Memory-only nodes will often have affinity to a compute node, and
platforms have ways to express that locality relationship.

A node containing CPUs or other DMA devices that can initiate memory
access are referred to as "memory iniators". A "memory target" is a
node that provides at least one phyiscal address range accessible to a
memory initiator.

In preparation for these systems, provide a new kernel API to link
the target memory node to its initiator compute node with symlinks to
each other.

The following example shows the new sysfs hierarchy setup for memory node
'Y' local to commpute node 'X':

  # ls -l /sys/devices/system/node/nodeX/initiator*
  /sys/devices/system/node/nodeX/targetY -> ../nodeY

  # ls -l /sys/devices/system/node/nodeY/target*
  /sys/devices/system/node/nodeY/initiatorX -> ../nodeX

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/node.c  | 32 ++++++++++++++++++++++++++++++++
 include/linux/node.h |  2 ++
 2 files changed, 34 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..a9b7512a9502 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -372,6 +372,38 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
 				 kobject_name(&node_devices[nid]->dev.kobj));
 }
 
+int register_memory_node_under_compute_node(unsigned int m, unsigned int p)
+{
+	int ret;
+	char initiator[20], target[17];
+
+	if (!node_online(p) || !node_online(m))
+		return -ENODEV;
+	if (m == p)
+		return 0;
+
+	snprintf(initiator, sizeof(initiator), "initiator%d", p);
+	snprintf(target, sizeof(target), "target%d", m);
+
+	ret = sysfs_create_link(&node_devices[p]->dev.kobj,
+				&node_devices[m]->dev.kobj,
+				target);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_link(&node_devices[m]->dev.kobj,
+				&node_devices[p]->dev.kobj,
+				initiator);
+	if (ret)
+		goto err;
+
+	return 0;
+ err:
+	sysfs_remove_link(&node_devices[p]->dev.kobj,
+			  kobject_name(&node_devices[m]->dev.kobj));
+	return ret;
+}
+
 int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 {
 	struct device *obj;
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..1fd734a3fb3f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -75,6 +75,8 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
+extern int register_memory_node_under_compute_node(unsigned int m, unsigned int p);
+
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
 					 node_registration_func_t unregister);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/7] node: Add heterogenous memory performance
  2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
@ 2018-11-14 22:49 ` Keith Busch
  2018-11-19  3:35   ` Anshuman Khandual
  2018-11-27  7:00   ` Dan Williams
  2018-11-14 22:49 ` [PATCH 3/7] doc/vm: New documentation for " Keith Busch
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-14 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Heterogeneous memory systems provide memory nodes with latency
and bandwidth performance attributes that are different from other
nodes. Create an interface for the kernel to register these attributes
under the node that provides the memory. If the system provides this
information, applications can query the node attributes when deciding
which node to request memory.

When multiple memory initiators exist, accessing the same memory target
from each may not perform the same as the other. The highest performing
initiator to a given target is considered to be a local initiator for
that target. The kernel provides performance attributes only for the
local initiators.

The memory's compute node should be symlinked in sysfs as one of the
node's initiators.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree /sys/devices/system/node/nodeY/initiator_access
  /sys/devices/system/node/nodeY/initiator_access
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in nanoseconds.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/Kconfig |  8 ++++++++
 drivers/base/node.c  | 44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/node.h | 22 ++++++++++++++++++++++
 3 files changed, 74 insertions(+)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index ae213ed2a7c8..2cf67c80046d 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE
 	  unusable. You should say N here unless you are explicitly looking to
 	  test this functionality.
 
+config HMEM
+	bool
+	default y
+	depends on NUMA
+	help
+	  Enable reporting for heterogenous memory access attributes under
+	  their non-uniform memory nodes.
+
 source "drivers/base/test/Kconfig"
 
 config SYS_HYPERVISOR
diff --git a/drivers/base/node.c b/drivers/base/node.c
index a9b7512a9502..232535761998 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -59,6 +59,50 @@ static inline ssize_t node_read_cpulist(struct device *dev,
 static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
+#ifdef CONFIG_HMEM
+const struct attribute_group node_access_attrs_group;
+
+#define ACCESS_ATTR(name) 						\
+static ssize_t name##_show(struct device *dev,				\
+			   struct device_attribute *attr,		\
+			   char *buf)					\
+{									\
+	return sprintf(buf, "%d\n", to_node(dev)->hmem_attrs.name);	\
+}									\
+static DEVICE_ATTR_RO(name);
+
+ACCESS_ATTR(read_bandwidth)
+ACCESS_ATTR(read_latency)
+ACCESS_ATTR(write_bandwidth)
+ACCESS_ATTR(write_latency)
+
+static struct attribute *access_attrs[] = {
+	&dev_attr_read_bandwidth.attr,
+	&dev_attr_read_latency.attr,
+	&dev_attr_write_bandwidth.attr,
+	&dev_attr_write_latency.attr,
+	NULL,
+};
+
+const struct attribute_group node_access_attrs_group = {
+	.name		= "initiator_access",
+	.attrs		= access_attrs,
+};
+
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs)
+{
+	struct node *node;
+
+	if (WARN_ON_ONCE(!node_online(nid)))
+		return;
+	node = node_devices[nid];
+	node->hmem_attrs = *hmem_attrs;
+	if (sysfs_create_group(&node->dev.kobj, &node_access_attrs_group))
+		pr_info("failed to add performance attribute group to node %d\n",
+			nid);
+}
+#endif
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
diff --git a/include/linux/node.h b/include/linux/node.h
index 1fd734a3fb3f..6a1aa6a153f8 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -17,14 +17,36 @@
 
 #include <linux/device.h>
 #include <linux/cpumask.h>
+#include <linux/list.h>
 #include <linux/workqueue.h>
 
+#ifdef CONFIG_HMEM
+/**
+ * struct node_hmem_attrs - heterogeneous memory performance attributes
+ *
+ * read_bandwidth:	Read bandwidth in MB/s
+ * write_bandwidth:	Write bandwidth in MB/s
+ * read_latency:	Read latency in nanoseconds
+ * write_latency:	Write latency in nanoseconds
+ */
+struct node_hmem_attrs {
+	unsigned int read_bandwidth;
+	unsigned int write_bandwidth;
+	unsigned int read_latency;
+	unsigned int write_latency;
+};
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs);
+#endif
+
 struct node {
 	struct device	dev;
 
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
+#ifdef CONFIG_HMEM
+	struct node_hmem_attrs hmem_attrs;
+#endif
 };
 
 struct memory_block;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/7] doc/vm: New documentation for memory performance
  2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
  2018-11-14 22:49 ` [PATCH 2/7] node: Add heterogenous memory performance Keith Busch
@ 2018-11-14 22:49 ` Keith Busch
  2018-11-15 12:59   ` Jonathan Cameron
  2018-11-20 13:51   ` Mike Rapoport
  2018-11-14 22:49 ` [PATCH 4/7] node: Add memory caching attributes Keith Busch
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-14 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Platforms may provide system memory where some physical address ranges
perform differently than others. These heterogeneous memory attributes are
common to the node that provides the memory and exported by the kernel.

Add new documentation providing a brief overview of such systems and
the attributes the kernel makes available to aid applications wishing
to query this information.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/vm/numaperf.rst | 71 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100644 Documentation/vm/numaperf.rst

diff --git a/Documentation/vm/numaperf.rst b/Documentation/vm/numaperf.rst
new file mode 100644
index 000000000000..5a3ecaff5474
--- /dev/null
+++ b/Documentation/vm/numaperf.rst
@@ -0,0 +1,71 @@
+.. _numaperf:
+
+================
+NUMA Performance
+================
+
+Some platforms may have multiple types of memory attached to a single
+CPU. These disparate memory ranges share some characteristics, such as
+CPU cache coherence, but may have different performance. For example,
+different media types and buses affect bandwidth and latency.
+
+A system supporting such heterogeneous memory groups each memory type
+under different "nodes" based on similar CPU locality and performance
+characteristics.  Some memory may share the same node as a CPU, and
+others are provided as memory-only nodes. While memory only nodes do not
+provide CPUs, they may still be local to one or more compute nodes. The
+following diagram shows one such example of two compute noes with local
+memory and a memory only node for each of compute node:
+
+ +------------------+     +------------------+
+ | Compute Node 0   +-----+ Compute Node 1   |
+ | Local Node0 Mem  |     | Local Node1 Mem  |
+ +--------+---------+     +--------+---------+
+          |                        |
+ +--------+---------+     +--------+---------+
+ | Slower Node2 Mem |     | Slower Node3 Mem |
+ +------------------+     +--------+---------+
+
+A "memory initiator" is a node containing one or more devices such as
+CPUs or separate memory I/O devices that can initiate memory requests. A
+"memory target" is a node containing one or more CPU-accessible physical
+address ranges.
+
+When multiple memory initiators exist, accessing the same memory
+target may not perform the same as each other. The highest performing
+initiator to a given target is considered to be one of that target's
+local initiators.
+
+To aid applications matching memory targets with their initiators,
+the kernel provide symlinks to each other like the following example::
+
+	# ls -l /sys/devices/system/node/nodeX/initiator*
+	/sys/devices/system/node/nodeX/targetY -> ../nodeY
+
+	# ls -l /sys/devices/system/node/nodeY/target*
+	/sys/devices/system/node/nodeY/initiatorX -> ../nodeX
+
+Applications may wish to consider which node they want their memory to
+be allocated from based on the nodes performance characteristics. If
+the system provides these attributes, the kernel exports them under the
+node sysfs hierarchy by appending the initiator_access directory under
+the node as follows::
+
+	/sys/devices/system/node/nodeY/initiator_access/
+
+The kernel does not provide performance attributes for non-local memory
+initiators. The performance characteristics the kernel provides for
+the local initiators are exported are as follows::
+
+	# tree /sys/devices/system/node/nodeY/initiator_access
+	/sys/devices/system/node/nodeY/initiator_access
+	|-- read_bandwidth
+	|-- read_latency
+	|-- write_bandwidth
+	`-- write_latency
+
+The bandwidth attributes are provided in MiB/second.
+
+The latency attributes are provided in nanoseconds.
+
+See also: https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/7] node: Add memory caching attributes
  2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
  2018-11-14 22:49 ` [PATCH 2/7] node: Add heterogenous memory performance Keith Busch
  2018-11-14 22:49 ` [PATCH 3/7] doc/vm: New documentation for " Keith Busch
@ 2018-11-14 22:49 ` Keith Busch
  2018-11-15  0:40   ` Dave Hansen
                     ` (3 more replies)
  2018-11-14 22:49 ` [PATCH 5/7] doc/vm: New documentation for memory cache Keith Busch
                   ` (4 subsequent siblings)
  7 siblings, 4 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-14 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

System memory may have side caches to help improve access speed. While
the system provided cache is transparent to the software accessing
these memory ranges, applications can optimize their own access based
on cache attributes.

In preparation for such systems, provide a new API for the kernel to
register these memory side caches under the memory node that provides it.

The kernel's sysfs representation is modeled from the cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
cacheinfo, though, a higher node's memory cache level is nearer to the
CPU, while lower levels are closer to the backing memory. Also unlike
CPU cache, the system handles flushing any dirty cached memory to the
last level the memory on a power failure if the range is persistent.

The exported attributes are the cache size, the line size, associativity,
and write back policy.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/base/node.c  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/node.h |  23 ++++++++++
 2 files changed, 140 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 232535761998..bb94f1d18115 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -60,6 +60,12 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
 #ifdef CONFIG_HMEM
+struct node_cache_obj {
+	struct kobject kobj;
+	struct list_head node;
+	struct node_cache_attrs cache_attrs;
+};
+
 const struct attribute_group node_access_attrs_group;
 
 #define ACCESS_ATTR(name) 						\
@@ -101,6 +107,115 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs)
 		pr_info("failed to add performance attribute group to node %d\n",
 			nid);
 }
+
+struct cache_attribute_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct node_cache_attrs *, char *);
+};
+
+#define CACHE_ATTR(name, fmt) 						\
+static ssize_t name##_show(struct node_cache_attrs *cache,		\
+			   char *buf)					\
+{									\
+	return sprintf(buf, fmt "\n", cache->name);			\
+}									\
+static struct cache_attribute_entry cache_attr_##name = __ATTR_RO(name);
+
+CACHE_ATTR(size, "%lld")
+CACHE_ATTR(level, "%d")
+CACHE_ATTR(line_size, "%d")
+CACHE_ATTR(associativity, "%d")
+CACHE_ATTR(write_policy, "%d")
+
+static struct attribute *cache_attrs[] = {
+	&cache_attr_level.attr,
+	&cache_attr_associativity.attr,
+	&cache_attr_size.attr,
+	&cache_attr_line_size.attr,
+	&cache_attr_write_policy.attr,
+	NULL,
+};
+
+static ssize_t cache_attr_show(struct kobject *kobj, struct attribute *attr,
+			       char *page)
+{
+	struct cache_attribute_entry *entry =
+			container_of(attr, struct cache_attribute_entry, attr);
+	struct node_cache_obj *cache_obj =
+			container_of(kobj, struct node_cache_obj, kobj);
+	return entry->show(&cache_obj->cache_attrs, page);
+}
+
+static const struct sysfs_ops cache_ops = {
+	.show	= &cache_attr_show,
+};
+
+static struct kobj_type cache_ktype = {
+	.default_attrs	= cache_attrs,
+	.sysfs_ops	= &cache_ops,
+};
+
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
+{
+	struct node_cache_obj *cache_obj;
+	struct node *node;
+
+	if (!node_online(nid) || !node_devices[nid])
+		return;
+
+	node = node_devices[nid];
+	list_for_each_entry(cache_obj, &node->cache_attrs, node) {
+		if (cache_obj->cache_attrs.level == cache_attrs->level) {
+			dev_warn(&node->dev,
+				"attempt to add duplicate cache level:%d\n",
+				cache_attrs->level);
+			return;
+		}
+	}
+
+	if (!node->cache_kobj)
+		node->cache_kobj = kobject_create_and_add("cache",
+							  &node->dev.kobj);
+	if (!node->cache_kobj)
+		return;
+
+	cache_obj = kzalloc(sizeof(*cache_obj), GFP_KERNEL);
+	if (!cache_obj)
+		return;
+
+	cache_obj->cache_attrs = *cache_attrs;
+	if (kobject_init_and_add(&cache_obj->kobj, &cache_ktype, node->cache_kobj,
+				 "index%d", cache_attrs->level)) {
+		dev_warn(&node->dev, "failed to add cache level:%d\n",
+			 cache_attrs->level);
+		kfree(cache_obj);
+		return;
+	}
+	list_add_tail(&cache_obj->node, &node->cache_attrs);
+}
+
+static void node_remove_caches(struct node *node)
+{
+	struct node_cache_obj *obj, *next;
+
+	if (!node->cache_kobj)
+		return;
+
+	list_for_each_entry_safe(obj, next, &node->cache_attrs, node) {
+		list_del(&obj->node);
+		kobject_put(&obj->kobj);
+		kfree(obj);
+	}
+	kobject_put(node->cache_kobj);
+}
+
+static void node_init_caches(unsigned int nid)
+{
+	INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
+}
+#else
+static void node_init_caches(unsigned int nid) { }
+static void node_remove_caches(struct node *node) { }
 #endif
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
@@ -345,6 +460,7 @@ static void node_device_release(struct device *dev)
 	 */
 	flush_work(&node->node_work);
 #endif
+	node_remove_caches(node);
 	kfree(node);
 }
 
@@ -658,6 +774,7 @@ int __register_one_node(int nid)
 
 	/* initialize work queue for memory hot plug */
 	init_node_hugetlb_work(nid);
+	node_init_caches(nid);
 
 	return error;
 }
diff --git a/include/linux/node.h b/include/linux/node.h
index 6a1aa6a153f8..f499a17f84bc 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -36,6 +36,27 @@ struct node_hmem_attrs {
 	unsigned int write_latency;
 };
 void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs);
+
+enum cache_associativity {
+	NODE_CACHE_DIRECT_MAP,
+	NODE_CACHE_INDEXED,
+	NODE_CACHE_OTHER,
+};
+
+enum cache_write_policy {
+	NODE_CACHE_WRITE_BACK,
+	NODE_CACHE_WRITE_THROUGH,
+	NODE_CACHE_WRITE_OTHER,
+};
+
+struct node_cache_attrs {
+	enum cache_associativity associativity;
+	enum cache_write_policy write_policy;
+	u64 size;
+	u16 line_size;
+	u8  level;
+};
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
 #endif
 
 struct node {
@@ -46,6 +67,8 @@ struct node {
 #endif
 #ifdef CONFIG_HMEM
 	struct node_hmem_attrs hmem_attrs;
+	struct list_head cache_attrs;
+	struct kobject *cache_kobj;
 #endif
 };
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/7] doc/vm: New documentation for memory cache
  2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
                   ` (2 preceding siblings ...)
  2018-11-14 22:49 ` [PATCH 4/7] node: Add memory caching attributes Keith Busch
@ 2018-11-14 22:49 ` Keith Busch
  2018-11-15  0:41   ` Dave Hansen
                     ` (2 more replies)
  2018-11-14 22:49 ` [PATCH 6/7] acpi: Create subtable parsing infrastructure Keith Busch
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-14 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Platforms may provide system memory that contains side caches to help
spped up access. These memory caches are part of a memory node and
the cache attributes are exported by the kernel.

Add new documentation providing a brief overview of system memory side
caches and the kernel provided attributes for application optimization.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/vm/numacache.rst | 76 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)
 create mode 100644 Documentation/vm/numacache.rst

diff --git a/Documentation/vm/numacache.rst b/Documentation/vm/numacache.rst
new file mode 100644
index 000000000000..e79c801b7e3b
--- /dev/null
+++ b/Documentation/vm/numacache.rst
@@ -0,0 +1,76 @@
+.. _numacache:
+
+==========
+NUMA Cache
+==========
+
+System memory may be constructed in a hierarchy of various performing
+characteristics in order to provide large address space of slower
+performing memory cached by a smaller size of higher performing
+memory. The system physical addresses that software is aware of see
+is provided by the last memory level in the hierarchy, while higher
+performing memory transparently provides caching to slower levels.
+
+The term "far memory" is used to denote the last level memory in the
+hierarchy. Each increasing cache level provides higher performing CPU
+access, and the term "near memory" represents the highest level cache
+provided by the system. This number is different than CPU caches where
+the cache level (ex: L1, L2, L3) uses a CPU centric view with each level
+being lower performing and closer to system memory. The memory cache
+level is centric to the last level memory, so the higher numbered cache
+level denotes memory nearer to the CPU, and further from far memory.
+
+The memory side caches are not directly addressable by software. When
+software accesses a system address, the system will return it from the
+near memory cache if it is present. If it is not present, the system
+accesses the next level of memory until there is either a hit in that
+cache level, or it reaches far memory.
+
+In order to maximize the performance out of such a setup, software may
+wish to query the memory cache attributes. If the system provides a way
+to query this information, for example with ACPI HMAT (Heterogeneous
+Memory Attribute Table)[1], the kernel will append these attributes to
+the NUMA node that provides the memory.
+
+When the kernel first registers a memory cache with a node, the kernel
+will create the following directory::
+
+	/sys/devices/system/node/nodeX/cache/
+
+If that directory is not present, then either the memory does not have
+a side cache, or that information is not provided to the kernel.
+
+The attributes for each level of cache is provided under its cache
+level index::
+
+	/sys/devices/system/node/nodeX/cache/indexA/
+	/sys/devices/system/node/nodeX/cache/indexB/
+	/sys/devices/system/node/nodeX/cache/indexC/
+
+Each cache level's directory provides its attributes. For example,
+the following is a single cache level and the attributes available for
+software to query::
+
+	# tree sys/devices/system/node/node0/cache/
+	/sys/devices/system/node/node0/cache/
+	|-- index1
+	|   |-- associativity
+	|   |-- level
+	|   |-- line_size
+	|   |-- size
+	|   `-- write_policy
+
+The cache "associativity" will be 0 if it is a direct-mapped cache, and
+non-zero for any other indexed based, multi-way associativity.
+
+The "level" is the distance from the far memory, and matches the number
+appended to its "index" directory.
+
+The "line_size" is the number of bytes accessed on a cache miss.
+
+The "size" is the number of bytes provided by this cache level.
+
+The "write_policy" will be 0 for write-back, and non-zero for
+write-through caching.
+
+[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/7] acpi: Create subtable parsing infrastructure
  2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
                   ` (3 preceding siblings ...)
  2018-11-14 22:49 ` [PATCH 5/7] doc/vm: New documentation for memory cache Keith Busch
@ 2018-11-14 22:49 ` Keith Busch
  2018-11-19  9:58   ` Rafael J. Wysocki
  2018-11-14 22:49 ` [PATCH 7/7] acpi/hmat: Parse and report heterogeneous memory Keith Busch
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 43+ messages in thread
From: Keith Busch @ 2018-11-14 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

Parsing entries in an ACPI table had assumed a generic header structure
that is most common. There is no standard ACPI header, though, so less
common types would need custom parsers if they want go walk their
subtable entry list.

Create the infrastructure for adding different table types so parsing
the entries array may be more reused for all ACPI system tables.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/tables.c | 75 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 65 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 61203eebf3a1..15ee77780f68 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -49,6 +49,19 @@ static struct acpi_table_desc initial_tables[ACPI_MAX_TABLES] __initdata;
 
 static int acpi_apic_instance __initdata;
 
+enum acpi_subtable_type {
+	ACPI_SUBTABLE_COMMON,
+};
+
+union acpi_subtable_headers {
+	struct acpi_subtable_header common;
+};
+
+struct acpi_subtable_entry {
+	union acpi_subtable_headers *hdr;
+	enum acpi_subtable_type type;
+};
+
 /*
  * Disable table checksum verification for the early stage due to the size
  * limitation of the current x86 early mapping implementation.
@@ -217,6 +230,45 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
 	}
 }
 
+static unsigned long __init
+acpi_get_entry_type(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return entry->hdr->common.type;
+	}
+	WARN_ONCE(1, "invalid acpi type\n");
+	return 0;
+}
+
+static unsigned long __init
+acpi_get_entry_length(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return entry->hdr->common.length;
+	}
+	WARN_ONCE(1, "invalid acpi type\n");
+	return 0;
+}
+
+static unsigned long __init
+acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
+{
+	switch (entry->type) {
+	case ACPI_SUBTABLE_COMMON:
+		return sizeof(entry->hdr->common);
+	}
+	WARN_ONCE(1, "invalid acpi type\n");
+	return 0;
+}
+
+static enum acpi_subtable_type __init
+acpi_get_subtable_type(char *id)
+{
+	return ACPI_SUBTABLE_COMMON;
+}
+
 /**
  * acpi_parse_entries_array - for each proc_num find a suitable subtable
  *
@@ -246,8 +298,8 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 		struct acpi_subtable_proc *proc, int proc_num,
 		unsigned int max_entries)
 {
-	struct acpi_subtable_header *entry;
-	unsigned long table_end;
+	struct acpi_subtable_entry entry;
+	unsigned long table_end, subtable_len, entry_len;
 	int count = 0;
 	int errs = 0;
 	int i;
@@ -270,19 +322,21 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 
 	/* Parse all entries looking for a match. */
 
-	entry = (struct acpi_subtable_header *)
+	entry.type = acpi_get_subtable_type(id);
+	entry.hdr = (union acpi_subtable_headers *)
 	    ((unsigned long)table_header + table_size);
+	subtable_len = acpi_get_subtable_header_length(&entry);
 
-	while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
-	       table_end) {
+	while (((unsigned long)entry.hdr) + subtable_len  < table_end) {
 		if (max_entries && count >= max_entries)
 			break;
 
 		for (i = 0; i < proc_num; i++) {
-			if (entry->type != proc[i].id)
+			if (acpi_get_entry_type(&entry) != proc[i].id)
 				continue;
 			if (!proc[i].handler ||
-			     (!errs && proc[i].handler(entry, table_end))) {
+			     (!errs && proc[i].handler(&entry.hdr->common,
+						       table_end))) {
 				errs++;
 				continue;
 			}
@@ -297,13 +351,14 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
 		 * If entry->length is 0, break from this loop to avoid
 		 * infinite loop.
 		 */
-		if (entry->length == 0) {
+		entry_len = acpi_get_entry_length(&entry);
+		if (entry_len == 0) {
 			pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
 			return -EINVAL;
 		}
 
-		entry = (struct acpi_subtable_header *)
-		    ((unsigned long)entry + entry->length);
+		entry.hdr = (union acpi_subtable_headers *)
+		    ((unsigned long)entry.hdr + entry_len);
 	}
 
 	if (max_entries && count > max_entries) {
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 7/7] acpi/hmat: Parse and report heterogeneous memory
  2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
                   ` (4 preceding siblings ...)
  2018-11-14 22:49 ` [PATCH 6/7] acpi: Create subtable parsing infrastructure Keith Busch
@ 2018-11-14 22:49 ` Keith Busch
  2018-11-15 13:57 ` [PATCH 1/7] node: Link memory nodes to their compute nodes Matthew Wilcox
  2018-11-19  2:46 ` Anshuman Khandual
  7 siblings, 0 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-14 22:49 UTC (permalink / raw)
  To: linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams,
	Keith Busch

The ACPI Heterogeneous Memory Attribute Table (HMAT) table added in ACPI 6.2
provides a way for platforms to express different memory address range
characteristics.

Parse these tables provided by the platform and register the local
initiator performance access and caching attributes with the memory numa
nodes so they can be observed through sysfs.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/acpi/Kconfig  |   9 ++
 drivers/acpi/Makefile |   1 +
 drivers/acpi/hmat.c   | 384 ++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/acpi/tables.c |  10 ++
 4 files changed, 404 insertions(+)
 create mode 100644 drivers/acpi/hmat.c

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 7cea769c37df..9e6b767dd366 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -327,6 +327,15 @@ config ACPI_NUMA
 	depends on (X86 || IA64 || ARM64)
 	default y if IA64_GENERIC || IA64_SGI_SN2 || ARM64
 
+config ACPI_HMAT
+	bool "ACPI Heterogeneous Memory Attribute Table Support"
+	depends on ACPI_NUMA
+	select HMEM
+	help
+	 Parses representation of the ACPI Heterogeneous Memory Attributes
+	 Table (HMAT) and set the memory node relationships and access
+	 attributes.
+
 config ACPI_CUSTOM_DSDT_FILE
 	string "Custom DSDT Table file to include"
 	default ""
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index edc039313cd6..b5e13499f88b 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -55,6 +55,7 @@ acpi-$(CONFIG_X86)		+= x86/apple.o
 acpi-$(CONFIG_X86)		+= x86/utils.o
 acpi-$(CONFIG_DEBUG_FS)		+= debugfs.o
 acpi-$(CONFIG_ACPI_NUMA)	+= numa.o
+acpi-$(CONFIG_ACPI_HMAT)	+= hmat.o
 acpi-$(CONFIG_ACPI_PROCFS_POWER) += cm_sbs.o
 acpi-y				+= acpi_lpat.o
 acpi-$(CONFIG_ACPI_LPIT)	+= acpi_lpit.o
diff --git a/drivers/acpi/hmat.c b/drivers/acpi/hmat.c
new file mode 100644
index 000000000000..9070e84fd30e
--- /dev/null
+++ b/drivers/acpi/hmat.c
@@ -0,0 +1,384 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Heterogeneous Memory Attributes Table (HMAT) representation
+ *
+ * Copyright (c) 2018, Intel Corporation.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/bitops.h>
+#include <linux/cpu.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/node.h>
+#include <linux/slab.h>
+#include <linux/sysfs.h>
+
+static LIST_HEAD(targets);
+
+struct memory_target {
+	struct list_head node;
+	unsigned int memory_pxm;
+	unsigned long processor_nodes[BITS_TO_LONGS(MAX_NUMNODES)];
+	struct node_hmem_attrs hmem;
+};
+
+static __init struct memory_target *find_mem_target(unsigned int m)
+{
+	struct memory_target *t;
+
+	list_for_each_entry(t, &targets, node)
+		if (t->memory_pxm == m)
+			return t;
+	return NULL;
+}
+
+static __init void alloc_memory_target(unsigned int mem_pxm)
+{
+	struct memory_target *t;
+
+	if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
+		return;
+
+	t = find_mem_target(mem_pxm);
+	if (t)
+		return;
+
+	t = kzalloc(sizeof(*t), GFP_KERNEL);
+	if (!t)
+		return;
+
+	t->memory_pxm = mem_pxm;
+	list_add_tail(&t->node, &targets);
+}
+
+static __init const char *hmat_data_type(u8 type)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+		return "Access Latency";
+	case ACPI_HMAT_READ_LATENCY:
+		return "Read Latency";
+	case ACPI_HMAT_WRITE_LATENCY:
+		return "Write Latency";
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+		return "Access Bandwidth";
+	case ACPI_HMAT_READ_BANDWIDTH:
+		return "Read Bandwidth";
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		return "Write Bandwidth";
+	default:
+		return "Reserved";
+	};
+}
+
+static __init const char *hmat_data_type_suffix(u8 type)
+{
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+	case ACPI_HMAT_READ_LATENCY:
+	case ACPI_HMAT_WRITE_LATENCY:
+		return " nsec";
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+	case ACPI_HMAT_READ_BANDWIDTH:
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		return " MB/s";
+	default:
+		return "";
+	};
+}
+
+static __init void hmat_update_value(u32 *dest, u32 value, bool gt,
+				     bool *update, bool *tie)
+{
+	if (!(*dest)) {
+		*update = true;
+		*dest = value;
+	} else if (gt && value > *dest) {
+		*update = true;
+		*dest = value;
+	} else if (!gt && value < *dest) {
+		*update = true;
+		*dest = value;
+	} else if (*dest == value) {
+		*tie = true;
+	}
+}
+
+static __init void hmat_update_target_access(unsigned int p, unsigned int m, u8 type,
+				      unsigned int value)
+{
+	struct memory_target *t = find_mem_target(m);
+	int p_node = pxm_to_node(p);
+	bool update = false, tie = false;
+
+	if (!t || p_node == NUMA_NO_NODE || !value)
+		return;
+
+	switch (type) {
+	case ACPI_HMAT_ACCESS_LATENCY:
+		hmat_update_value(&t->hmem.read_latency, value, false,
+				  &update, &tie);
+		hmat_update_value(&t->hmem.write_latency, value, false,
+				  &update, &tie);
+		break;
+	case ACPI_HMAT_READ_LATENCY:
+		hmat_update_value(&t->hmem.read_latency, value, false,
+				  &update, &tie);
+		break;
+	case ACPI_HMAT_WRITE_LATENCY:
+		hmat_update_value(&t->hmem.write_latency, value, false,
+				  &update, &tie);
+		break;
+	case ACPI_HMAT_ACCESS_BANDWIDTH:
+		hmat_update_value(&t->hmem.read_bandwidth, value, true,
+				  &update, &tie);
+		hmat_update_value(&t->hmem.write_bandwidth, value, true,
+				  &update, &tie);
+		break;
+	case ACPI_HMAT_READ_BANDWIDTH:
+		hmat_update_value(&t->hmem.read_bandwidth, value, true,
+				  &update, &tie);
+		break;
+	case ACPI_HMAT_WRITE_BANDWIDTH:
+		hmat_update_value(&t->hmem.write_bandwidth, value, true,
+				  &update, &tie);
+		break;
+	}
+
+	if (update) {
+		bitmap_zero(t->processor_nodes, MAX_NUMNODES);
+		set_bit(p_node, t->processor_nodes);
+	} else if (tie)
+		set_bit(p_node, t->processor_nodes);
+}
+
+static __init int hmat_parse_locality(struct acpi_subtable_header *header,
+				      const unsigned long end)
+{
+	struct acpi_hmat_locality *loc = (void *)header;
+	unsigned int i, j, total_size, ipds, tpds;
+	u32 *inits, *targs, value;
+	u16 *entries;
+	u8 type;
+
+	if (loc->header.length < sizeof(*loc)) {
+		pr_err("HMAT: Unexpected locality header length: %d\n",
+			loc->header.length);
+		return -EINVAL;
+	}
+
+	type = loc->data_type;
+	ipds = loc->number_of_initiator_Pds;
+	tpds = loc->number_of_target_Pds;
+	total_size = sizeof(*loc) + sizeof(*entries) * ipds * tpds +
+		     sizeof(*inits) * ipds + sizeof(*targs) * tpds;
+	if (loc->header.length < total_size) {
+		pr_err("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
+			loc->header.length, total_size);
+		return -EINVAL;
+	}
+
+	pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
+		loc->flags, hmat_data_type(type), ipds, tpds,
+		loc->entry_base_unit);
+
+	inits = (u32 *)(loc + 1);
+	targs = &inits[ipds];
+	entries = (u16 *)(&targs[tpds]);
+	for (i = 0; i < ipds; i++) {
+		for (j = 0; j < loc->number_of_target_Pds; j++) {
+			value = (*entries * loc->entry_base_unit) / 10;
+			pr_info("  Initiator-Target[%d-%d]:%d%s\n", inits[i],
+				targs[j], value, hmat_data_type_suffix(type));
+
+			if ((loc->flags & ACPI_HMAT_MEMORY_HIERARCHY) ==
+							ACPI_HMAT_MEMORY)
+				hmat_update_target_access(inits[i], targs[j],
+							  type, value);
+			entries++;
+		}
+	}
+	return 0;
+}
+
+static __init int hmat_parse_cache(struct acpi_subtable_header *header,
+				   const unsigned long end)
+{
+	struct acpi_hmat_cache *cache = (void *)header;
+	struct node_cache_attrs cache_attrs;
+	u32 attrs;
+
+	if (cache->header.length < sizeof(*cache)) {
+		pr_err("HMAT: Unexpected cache header length: %d\n",
+			cache->header.length);
+		return -EINVAL;
+	}
+
+	attrs = cache->cache_attributes;
+	pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
+		cache->memory_PD, cache->cache_size, attrs,
+		cache->number_of_SMBIOShandles);
+
+	cache_attrs.size = cache->cache_size;
+	cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4;
+	cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16;
+
+	switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) {
+	case ACPI_HMAT_CA_DIRECT_MAPPED:
+		cache_attrs.associativity = NODE_CACHE_DIRECT_MAP;
+		break;
+	case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING:
+		cache_attrs.associativity = NODE_CACHE_INDEXED;
+		break;
+	case ACPI_HMAT_CA_NONE:
+	default:
+		cache_attrs.associativity = NODE_CACHE_OTHER;
+		break;
+	}
+	switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) {
+	case ACPI_HMAT_CP_WB:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_BACK;
+		break;
+	case ACPI_HMAT_CP_WT:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH;
+		break;
+	case ACPI_HMAT_CP_NONE:
+	default:
+		cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER;
+		break;
+	}
+
+	node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs);
+	return 0;
+}
+
+static int __init hmat_parse_address_range(struct acpi_subtable_header *header,
+					   const unsigned long end)
+{
+	struct acpi_hmat_address_range *spa = (void *)header;
+
+	if (spa->header.length != sizeof(*spa)) {
+		pr_err("HMAT: Unexpected address range header length: %d\n",
+			spa->header.length);
+		return -EINVAL;
+	}
+	pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
+		spa->physical_address_base, spa->physical_address_length,
+		spa->flags, spa->processor_PD, spa->memory_PD);
+	return 0;
+}
+
+static int __init hmat_parse_subtable(struct acpi_subtable_header *header,
+				      const unsigned long end)
+{
+	struct acpi_hmat_structure *hdr = (void *)header;
+
+	if (!hdr)
+		return -EINVAL;
+
+	switch (hdr->type) {
+	case ACPI_HMAT_TYPE_ADDRESS_RANGE:
+		return hmat_parse_address_range(header, end);
+	case ACPI_HMAT_TYPE_LOCALITY:
+		return hmat_parse_locality(header, end);
+	case ACPI_HMAT_TYPE_CACHE:
+		return hmat_parse_cache(header, end);
+	default:
+		return -EINVAL;
+	}
+}
+
+static __init int srat_parse_mem_affinity(struct acpi_subtable_header *header,
+					  const unsigned long end)
+{
+	struct acpi_srat_mem_affinity *ma = (void *)header;
+
+	if (!ma)
+		return -EINVAL;
+	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
+		return 0;
+	alloc_memory_target(ma->proximity_domain);
+	return 0;
+}
+
+static __init void hmat_register_targets(void)
+{
+	struct memory_target *t, *next;
+	unsigned m, p;
+
+	list_for_each_entry_safe(t, next, &targets, node) {
+		bool hmem_valid = false;
+
+		list_del(&t->node);
+		m = pxm_to_node(t->memory_pxm);
+
+		for_each_set_bit(p, t->processor_nodes, MAX_NUMNODES) {
+			if (register_memory_node_under_compute_node(m, p))
+				goto free_target;
+			hmem_valid = true;
+		}
+		if (hmem_valid)
+			node_set_perf_attrs(m, &t->hmem);
+free_target:
+		kfree(t);
+	}
+}
+
+static __init int parse_noop(struct acpi_table_header *table)
+{
+	return 0;
+}
+
+static __init int hmat_init(void)
+{
+	struct acpi_subtable_proc subtable_proc;
+	struct acpi_table_header *tbl;
+	enum acpi_hmat_type i;
+	acpi_status status;
+
+	if (srat_disabled())
+		return 0;
+
+	status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
+	if (ACPI_FAILURE(status))
+		return 0;
+
+	if (acpi_table_parse(ACPI_SIG_SRAT, parse_noop))
+		goto out_put;
+
+	memset(&subtable_proc, 0, sizeof(subtable_proc));
+	subtable_proc.id = ACPI_SRAT_TYPE_MEMORY_AFFINITY;
+	subtable_proc.handler = srat_parse_mem_affinity;
+
+	if (acpi_table_parse_entries_array(ACPI_SIG_SRAT,
+				sizeof(struct acpi_table_srat),
+				&subtable_proc, 1, 0) < 0)
+		goto out_put;
+	acpi_put_table(tbl);
+
+	status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+	if (ACPI_FAILURE(status))
+		return 0;
+
+	if (acpi_table_parse(ACPI_SIG_HMAT, parse_noop))
+		goto out_put;
+
+	memset(&subtable_proc, 0, sizeof(subtable_proc));
+	subtable_proc.handler = hmat_parse_subtable;
+	for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
+		subtable_proc.id = i;
+		if (acpi_table_parse_entries_array(ACPI_SIG_HMAT,
+					sizeof(struct acpi_table_hmat),
+					&subtable_proc, 1, 0) < 0)
+			goto out_put;
+	}
+	hmat_register_targets();
+ out_put:
+	acpi_put_table(tbl);
+	return 0;
+}
+device_initcall(hmat_init);
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 15ee77780f68..a34669a1d47a 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -51,10 +51,12 @@ static int acpi_apic_instance __initdata;
 
 enum acpi_subtable_type {
 	ACPI_SUBTABLE_COMMON,
+	ACPI_SUBTABLE_HMAT,
 };
 
 union acpi_subtable_headers {
 	struct acpi_subtable_header common;
+	struct acpi_hmat_structure hmat;
 };
 
 struct acpi_subtable_entry {
@@ -236,6 +238,8 @@ acpi_get_entry_type(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return entry->hdr->common.type;
+	case ACPI_SUBTABLE_HMAT:
+		return entry->hdr->hmat.type;
 	}
 	WARN_ONCE(1, "invalid acpi type\n");
 	return 0;
@@ -247,6 +251,8 @@ acpi_get_entry_length(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return entry->hdr->common.length;
+	case ACPI_SUBTABLE_HMAT:
+		return entry->hdr->hmat.length;
 	}
 	WARN_ONCE(1, "invalid acpi type\n");
 	return 0;
@@ -258,6 +264,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
 	switch (entry->type) {
 	case ACPI_SUBTABLE_COMMON:
 		return sizeof(entry->hdr->common);
+	case ACPI_SUBTABLE_HMAT:
+		return sizeof(entry->hdr->hmat);
 	}
 	WARN_ONCE(1, "invalid acpi type\n");
 	return 0;
@@ -266,6 +274,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
 static enum acpi_subtable_type __init
 acpi_get_subtable_type(char *id)
 {
+	if (strncmp(id, ACPI_SIG_HMAT, 4) == 0)
+		return ACPI_SUBTABLE_HMAT;
 	return ACPI_SUBTABLE_COMMON;
 }
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/7] node: Add memory caching attributes
  2018-11-14 22:49 ` [PATCH 4/7] node: Add memory caching attributes Keith Busch
@ 2018-11-15  0:40   ` Dave Hansen
  2018-11-19  4:14   ` Anshuman Khandual
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2018-11-15  0:40 UTC (permalink / raw)
  To: Keith Busch, linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dan Williams

On 11/14/18 2:49 PM, Keith Busch wrote:
> System memory may have side caches to help improve access speed. While
> the system provided cache is transparent to the software accessing
> these memory ranges, applications can optimize their own access based
> on cache attributes.
> 
> In preparation for such systems, provide a new API for the kernel to
> register these memory side caches under the memory node that provides it.
> 
> The kernel's sysfs representation is modeled from the cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> cacheinfo, though, a higher node's memory cache level is nearer to the
> CPU, while lower levels are closer to the backing memory. Also unlike
> CPU cache, the system handles flushing any dirty cached memory to the
> last level the memory on a power failure if the range is persistent.
> 
> The exported attributes are the cache size, the line size, associativity,
> and write back policy.

Could you also include an example of the layout?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/7] doc/vm: New documentation for memory cache
  2018-11-14 22:49 ` [PATCH 5/7] doc/vm: New documentation for memory cache Keith Busch
@ 2018-11-15  0:41   ` Dave Hansen
  2018-11-15 13:16   ` Jonathan Cameron
  2018-11-20 13:53   ` Mike Rapoport
  2 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2018-11-15  0:41 UTC (permalink / raw)
  To: Keith Busch, linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dan Williams

On 11/14/18 2:49 PM, Keith Busch wrote:
> +	# tree sys/devices/system/node/node0/cache/
> +	/sys/devices/system/node/node0/cache/
> +	|-- index1
> +	|   |-- associativity
> +	|   |-- level
> +	|   |-- line_size
> +	|   |-- size
> +	|   `-- write_policy

Whoops, and here it is...



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/7] doc/vm: New documentation for memory performance
  2018-11-14 22:49 ` [PATCH 3/7] doc/vm: New documentation for " Keith Busch
@ 2018-11-15 12:59   ` Jonathan Cameron
  2018-12-10 16:12     ` Dan Williams
  2018-11-20 13:51   ` Mike Rapoport
  1 sibling, 1 reply; 43+ messages in thread
From: Jonathan Cameron @ 2018-11-15 12:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Wed, 14 Nov 2018 15:49:16 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Platforms may provide system memory where some physical address ranges
> perform differently than others. These heterogeneous memory attributes are
> common to the node that provides the memory and exported by the kernel.
> 
> Add new documentation providing a brief overview of such systems and
> the attributes the kernel makes available to aid applications wishing
> to query this information.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
Hi Keith,

Good to see another attempt at this, particularly thinking about simplifying
what is provided to make it easier to use.

I need to have a bit of a think about how this maps onto more complex
topologies, but some initial comments / questions in the meantime.

> ---
>  Documentation/vm/numaperf.rst | 71 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 71 insertions(+)
>  create mode 100644 Documentation/vm/numaperf.rst
> 
> diff --git a/Documentation/vm/numaperf.rst b/Documentation/vm/numaperf.rst
> new file mode 100644
> index 000000000000..5a3ecaff5474
> --- /dev/null
> +++ b/Documentation/vm/numaperf.rst
> @@ -0,0 +1,71 @@
> +.. _numaperf:
> +
> +================
> +NUMA Performance
> +================
> +
> +Some platforms may have multiple types of memory attached to a single
> +CPU. These disparate memory ranges share some characteristics, such as
> +CPU cache coherence, but may have different performance. For example,
> +different media types and buses affect bandwidth and latency.
> +
> +A system supporting such heterogeneous memory groups each memory type
> +under different "nodes" based on similar CPU locality and performance
> +characteristics.

I think this statement should be more specific.  The requirement is that
it should have similar CPU locality and performance characteristics wrt
to every initiator in the system, not just the local one.

>  Some memory may share the same node as a CPU, and
> +others are provided as memory-only nodes. While memory only nodes do not
> +provide CPUs, they may still be local to one or more compute nodes. The
> +following diagram shows one such example of two compute noes with local
> +memory and a memory only node for each of compute node:
> +
> + +------------------+     +------------------+
> + | Compute Node 0   +-----+ Compute Node 1   |
> + | Local Node0 Mem  |     | Local Node1 Mem  |
> + +--------+---------+     +--------+---------+
> +          |                        |
> + +--------+---------+     +--------+---------+
> + | Slower Node2 Mem |     | Slower Node3 Mem |
> + +------------------+     +--------+---------+
> +
> +A "memory initiator" is a node containing one or more devices such as
> +CPUs or separate memory I/O devices that can initiate memory requests. A
> +"memory target" is a node containing one or more CPU-accessible physical
> +address ranges.
> +
> +When multiple memory initiators exist, accessing the same memory
> +target may not perform the same as each other. 

When multiple initiators exist, they may not all show the same performance
when accessing a given memory target.

> The highest performing
> +initiator to a given target is considered to be one of that target's
> +local initiators.

One of, or the only?   Are we allowing a many to one mapping if several
initiators have the same performance but are in different nodes?

Also, what is your measure of performance, latency or bandwidth or some
combination of the two?

> +
> +To aid applications matching memory targets with their initiators,
> +the kernel provide symlinks to each other like the following example::
> +
> +	# ls -l /sys/devices/system/node/nodeX/initiator*
> +	/sys/devices/system/node/nodeX/targetY -> ../nodeY
ls on initiator* is giving targetY?

> +
> +	# ls -l /sys/devices/system/node/nodeY/target*
> +	/sys/devices/system/node/nodeY/initiatorX -> ../nodeX
> +

Just to check as I'm not clear, do we have self links when the 
memory and initiators are in the same node?

> +Applications may wish to consider which node they want their memory to
> +be allocated from based on the nodes performance characteristics. If
> +the system provides these attributes, the kernel exports them under the
> +node sysfs hierarchy by appending the initiator_access directory under
> +the node as follows::
> +
> +	/sys/devices/system/node/nodeY/initiator_access/
> +
> +The kernel does not provide performance attributes for non-local memory
> +initiators. The performance characteristics the kernel provides for
> +the local initiators are exported are as follows::
> +
> +	# tree /sys/devices/system/node/nodeY/initiator_access
> +	/sys/devices/system/node/nodeY/initiator_access
> +	|-- read_bandwidth
> +	|-- read_latency
> +	|-- write_bandwidth
> +	`-- write_latency
> +
> +The bandwidth attributes are provided in MiB/second.
> +
> +The latency attributes are provided in nanoseconds.
> +
> +See also: https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

My worry here is we are explicitly making an interface that is only ever
providing "local" node information, where local node is not the best
defined thing in the world for complex topologies.

I have no problem with that making a sensible starting point for providing
information userspace knows what to do with, just with an interface that
in of itself doesn't make that clear.

Perhaps something as simple as
/sys/devices/system/nodeY/local_initiatorX
/sys/devices/system/nodeX/local_targetY

That leaves us the option of coming along later and having a full listing
when a userspace requirement has become clear.  Another option would
be an exhaustive list of all initiator / memory pairs that exist, with
an additional sysfs file giving a list of those that are nearest
to avoid every userspace program having to do the search.

Thanks,

Jonathan



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/7] doc/vm: New documentation for memory cache
  2018-11-14 22:49 ` [PATCH 5/7] doc/vm: New documentation for memory cache Keith Busch
  2018-11-15  0:41   ` Dave Hansen
@ 2018-11-15 13:16   ` Jonathan Cameron
  2018-11-20 13:53   ` Mike Rapoport
  2 siblings, 0 replies; 43+ messages in thread
From: Jonathan Cameron @ 2018-11-15 13:16 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Wed, 14 Nov 2018 15:49:18 -0700
Keith Busch <keith.busch@intel.com> wrote:

> Platforms may provide system memory that contains side caches to help

If we can call them "memory-side caches" that would avoid a persistent
confusion on what they actually are.  It took me ages to get to the
bottom of why they were always drawn to the side of the memory
path ;)

> spped up access. These memory caches are part of a memory node and

speed

> the cache attributes are exported by the kernel.
> 
> Add new documentation providing a brief overview of system memory side
> caches and the kernel provided attributes for application optimization.
A few few nits in line, but mostly looks good to me.

Thanks,

Jonathan

> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/vm/numacache.rst | 76 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 76 insertions(+)
>  create mode 100644 Documentation/vm/numacache.rst
> 
> diff --git a/Documentation/vm/numacache.rst b/Documentation/vm/numacache.rst
> new file mode 100644
> index 000000000000..e79c801b7e3b
> --- /dev/null
> +++ b/Documentation/vm/numacache.rst
> @@ -0,0 +1,76 @@
> +.. _numacache:
> +
> +==========
> +NUMA Cache
> +==========
> +
> +System memory may be constructed in a hierarchy of various performing

of elements with various performance

> +characteristics in order to provide large address space of slower
> +performing memory cached by a smaller size of higher performing

cached by smaller higher performing memory.

> +memory. The system physical addresses that software is aware of see

is aware of is provided (no 'see')

> +is provided by the last memory level in the hierarchy, while higher
> +performing memory transparently provides caching to slower levels.
> +
> +The term "far memory" is used to denote the last level memory in the
> +hierarchy. Each increasing cache level provides higher performing CPU

initiator rather than CPU?

> +access, and the term "near memory" represents the highest level cache
> +provided by the system. This number is different than CPU caches where
> +the cache level (ex: L1, L2, L3) uses a CPU centric view with each level
> +being lower performing and closer to system memory. The memory cache
> +level is centric to the last level memory, so the higher numbered cache

from the last level memory?

> +level denotes memory nearer to the CPU, and further from far memory.
> +
> +The memory side caches are not directly addressable by software. When
> +software accesses a system address, the system will return it from the
> +near memory cache if it is present. If it is not present, the system
> +accesses the next level of memory until there is either a hit in that
> +cache level, or it reaches far memory.
> +
> +In order to maximize the performance out of such a setup, software may
> +wish to query the memory cache attributes. If the system provides a way
> +to query this information, for example with ACPI HMAT (Heterogeneous
> +Memory Attribute Table)[1], the kernel will append these attributes to
> +the NUMA node that provides the memory.
> +
> +When the kernel first registers a memory cache with a node, the kernel
> +will create the following directory::
> +
> +	/sys/devices/system/node/nodeX/cache/

Given we have other things with caches in a numa node, should we make
this name more specific?

> +
> +If that directory is not present, then either the memory does not have
> +a side cache, or that information is not provided to the kernel.
> +
> +The attributes for each level of cache is provided under its cache
> +level index::
> +
> +	/sys/devices/system/node/nodeX/cache/indexA/
> +	/sys/devices/system/node/nodeX/cache/indexB/
> +	/sys/devices/system/node/nodeX/cache/indexC/
> +
> +Each cache level's directory provides its attributes. For example,
> +the following is a single cache level and the attributes available for
> +software to query::
> +
> +	# tree sys/devices/system/node/node0/cache/
> +	/sys/devices/system/node/node0/cache/
> +	|-- index1
> +	|   |-- associativity
> +	|   |-- level
> +	|   |-- line_size
> +	|   |-- size
> +	|   `-- write_policy
> +
> +The cache "associativity" will be 0 if it is a direct-mapped cache, and
> +non-zero for any other indexed based, multi-way associativity.
This description is a little vague.  Right now I think we have 3 options
from HMAT,
1) no associativity (which I suppose could also be called fully associative?)
2) direct mapped (0 in your case)
3) Complex (who knows!)

So how do you map 1 and 3?

> +
> +The "level" is the distance from the far memory, and matches the number
> +appended to its "index" directory.
> +
> +The "line_size" is the number of bytes accessed on a cache miss.
> +
> +The "size" is the number of bytes provided by this cache level.
> +
> +The "write_policy" will be 0 for write-back, and non-zero for
> +write-through caching.

Do these not appear if the write_policy provided by acpi is "none".

> +
> +[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
                   ` (5 preceding siblings ...)
  2018-11-14 22:49 ` [PATCH 7/7] acpi/hmat: Parse and report heterogeneous memory Keith Busch
@ 2018-11-15 13:57 ` Matthew Wilcox
  2018-11-15 14:59   ` Keith Busch
  2018-11-19  2:46 ` Anshuman Khandual
  7 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2018-11-15 13:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> Memory-only nodes will often have affinity to a compute node, and
> platforms have ways to express that locality relationship.
> 
> A node containing CPUs or other DMA devices that can initiate memory
> access are referred to as "memory iniators". A "memory target" is a
> node that provides at least one phyiscal address range accessible to a
> memory initiator.

I think I may be confused here.  If there is _no_ link from node X to
node Y, does that mean that node X's CPUs cannot access the memory on
node Y?  In my mind, all nodes can access all memory in the system,
just not with uniform bandwidth/latency.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-15 13:57 ` [PATCH 1/7] node: Link memory nodes to their compute nodes Matthew Wilcox
@ 2018-11-15 14:59   ` Keith Busch
  2018-11-15 17:50     ` Dan Williams
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-15 14:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > Memory-only nodes will often have affinity to a compute node, and
> > platforms have ways to express that locality relationship.
> > 
> > A node containing CPUs or other DMA devices that can initiate memory
> > access are referred to as "memory iniators". A "memory target" is a
> > node that provides at least one phyiscal address range accessible to a
> > memory initiator.
> 
> I think I may be confused here.  If there is _no_ link from node X to
> node Y, does that mean that node X's CPUs cannot access the memory on
> node Y?  In my mind, all nodes can access all memory in the system,
> just not with uniform bandwidth/latency.

The link is just about which nodes are "local". It's like how nodes have
a cpulist. Other CPUs not in the node's list can acces that node's memory,
but the ones in the mask are local, and provide useful optimization hints.

Would a node mask would be prefered to symlinks?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-15 14:59   ` Keith Busch
@ 2018-11-15 17:50     ` Dan Williams
  2018-11-19  3:04       ` Anshuman Khandual
  2018-11-15 20:36     ` Matthew Wilcox
  2018-11-19  2:52     ` Anshuman Khandual
  2 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2018-11-15 17:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matthew Wilcox, Linux Kernel Mailing List, Linux ACPI, Linux MM,
	Greg KH, Rafael J. Wysocki, Dave Hansen

On Thu, Nov 15, 2018 at 7:02 AM Keith Busch <keith.busch@intel.com> wrote:
>
> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > Memory-only nodes will often have affinity to a compute node, and
> > > platforms have ways to express that locality relationship.
> > >
> > > A node containing CPUs or other DMA devices that can initiate memory
> > > access are referred to as "memory iniators". A "memory target" is a
> > > node that provides at least one phyiscal address range accessible to a
> > > memory initiator.
> >
> > I think I may be confused here.  If there is _no_ link from node X to
> > node Y, does that mean that node X's CPUs cannot access the memory on
> > node Y?  In my mind, all nodes can access all memory in the system,
> > just not with uniform bandwidth/latency.
>
> The link is just about which nodes are "local". It's like how nodes have
> a cpulist. Other CPUs not in the node's list can acces that node's memory,
> but the ones in the mask are local, and provide useful optimization hints.
>
> Would a node mask would be prefered to symlinks?

I think that would be more flexible, because the set of initiators
that may have "best" or "local" access to a target may be more than 1.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-15 14:59   ` Keith Busch
  2018-11-15 17:50     ` Dan Williams
@ 2018-11-15 20:36     ` Matthew Wilcox
  2018-11-16 18:32       ` Keith Busch
  2018-11-16 22:55       ` Dan Williams
  2018-11-19  2:52     ` Anshuman Khandual
  2 siblings, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2018-11-15 20:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > Memory-only nodes will often have affinity to a compute node, and
> > > platforms have ways to express that locality relationship.
> > > 
> > > A node containing CPUs or other DMA devices that can initiate memory
> > > access are referred to as "memory iniators". A "memory target" is a
> > > node that provides at least one phyiscal address range accessible to a
> > > memory initiator.
> > 
> > I think I may be confused here.  If there is _no_ link from node X to
> > node Y, does that mean that node X's CPUs cannot access the memory on
> > node Y?  In my mind, all nodes can access all memory in the system,
> > just not with uniform bandwidth/latency.
> 
> The link is just about which nodes are "local". It's like how nodes have
> a cpulist. Other CPUs not in the node's list can acces that node's memory,
> but the ones in the mask are local, and provide useful optimization hints.

So ... let's imagine a hypothetical system (I've never seen one built like
this, but it doesn't seem too implausible).  Connect four CPU sockets in
a square, each of which has some regular DIMMs attached to it.  CPU A is
0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
Memory D (each CPU only has two "QPI" links).  Then maybe there's some
special memory extender device attached on the PCIe bus.  Now there's
Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
not as local as Memory B is ... and we'd probably _prefer_ to allocate
memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
this seems hard.

I understand you're trying to reflect what the HMAT table is telling you,
I'm just really fuzzy on who's ultimately consuming this information
and what decisions they're trying to drive from it.

> Would a node mask would be prefered to symlinks?

I don't have a strong opinion here, but what Dan says makes sense.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-15 20:36     ` Matthew Wilcox
@ 2018-11-16 18:32       ` Keith Busch
  2018-11-19  3:15         ` Anshuman Khandual
  2018-12-04 15:43         ` Aneesh Kumar K.V
  2018-11-16 22:55       ` Dan Williams
  1 sibling, 2 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-16 18:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, Nov 15, 2018 at 12:36:54PM -0800, Matthew Wilcox wrote:
> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
> > On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > > Memory-only nodes will often have affinity to a compute node, and
> > > > platforms have ways to express that locality relationship.
> > > > 
> > > > A node containing CPUs or other DMA devices that can initiate memory
> > > > access are referred to as "memory iniators". A "memory target" is a
> > > > node that provides at least one phyiscal address range accessible to a
> > > > memory initiator.
> > > 
> > > I think I may be confused here.  If there is _no_ link from node X to
> > > node Y, does that mean that node X's CPUs cannot access the memory on
> > > node Y?  In my mind, all nodes can access all memory in the system,
> > > just not with uniform bandwidth/latency.
> > 
> > The link is just about which nodes are "local". It's like how nodes have
> > a cpulist. Other CPUs not in the node's list can acces that node's memory,
> > but the ones in the mask are local, and provide useful optimization hints.
> 
> So ... let's imagine a hypothetical system (I've never seen one built like
> this, but it doesn't seem too implausible).  Connect four CPU sockets in
> a square, each of which has some regular DIMMs attached to it.  CPU A is
> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
> special memory extender device attached on the PCIe bus.  Now there's
> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
> not as local as Memory B is ... and we'd probably _prefer_ to allocate
> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
> this seems hard.

Indeed, that particular example is out of scope for this series. The
first objective is to aid a process running in node B's CPUs to allocate
memory in B1. Anything that crosses QPI are their own.

> I understand you're trying to reflect what the HMAT table is telling you,
> I'm just really fuzzy on who's ultimately consuming this information
> and what decisions they're trying to drive from it.

Intended consumers include processes using numa_alloc_onnode() and mbind().

Consider a system with faster DRAM and slower persistent memory. Such
a system may group the DRAM in a different proximity domain than the
persistent memory, and both are local to yet another proximity domain
that contains the CPUs. HMAT provides a way to express that relationship,
and this patch provides a user facing abstraction for that information.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-15 20:36     ` Matthew Wilcox
  2018-11-16 18:32       ` Keith Busch
@ 2018-11-16 22:55       ` Dan Williams
  1 sibling, 0 replies; 43+ messages in thread
From: Dan Williams @ 2018-11-16 22:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Linux Kernel Mailing List, Linux ACPI, Linux MM,
	Greg KH, Rafael J. Wysocki, Dave Hansen

On Thu, Nov 15, 2018 at 12:37 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
> > On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
> > > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
> > > > Memory-only nodes will often have affinity to a compute node, and
> > > > platforms have ways to express that locality relationship.
> > > >
> > > > A node containing CPUs or other DMA devices that can initiate memory
> > > > access are referred to as "memory iniators". A "memory target" is a
> > > > node that provides at least one phyiscal address range accessible to a
> > > > memory initiator.
> > >
> > > I think I may be confused here.  If there is _no_ link from node X to
> > > node Y, does that mean that node X's CPUs cannot access the memory on
> > > node Y?  In my mind, all nodes can access all memory in the system,
> > > just not with uniform bandwidth/latency.
> >
> > The link is just about which nodes are "local". It's like how nodes have
> > a cpulist. Other CPUs not in the node's list can acces that node's memory,
> > but the ones in the mask are local, and provide useful optimization hints.
>
> So ... let's imagine a hypothetical system (I've never seen one built like
> this, but it doesn't seem too implausible).  Connect four CPU sockets in
> a square, each of which has some regular DIMMs attached to it.  CPU A is
> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
> special memory extender device attached on the PCIe bus.  Now there's
> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
> not as local as Memory B is ... and we'd probably _prefer_ to allocate
> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
> this seems hard.
>
> I understand you're trying to reflect what the HMAT table is telling you,
> I'm just really fuzzy on who's ultimately consuming this information
> and what decisions they're trying to drive from it.

The singular "local" is a limitation of the HMAT, but I would expect
the Linux translation of "local" would allow for multiple initiators
that can achieve some semblance of the "best" performance. Anything
less than best is going to have a wide range of variance and will
likely devolve to looking at the platform firmware data table
directly. The expected 80% case is software wants to be able to ask
"which CPUs should I run on to get the best access to this memory?"

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
                   ` (6 preceding siblings ...)
  2018-11-15 13:57 ` [PATCH 1/7] node: Link memory nodes to their compute nodes Matthew Wilcox
@ 2018-11-19  2:46 ` Anshuman Khandual
  7 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2018-11-19  2:46 UTC (permalink / raw)
  To: Keith Busch, linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams



On 11/15/2018 04:19 AM, Keith Busch wrote:
> Memory-only nodes will often have affinity to a compute node, and
> platforms have ways to express that locality relationship.

It may not have a local affinity to any compute node but it might have a
valid NUMA distance from all available compute nodes. This is particularly
true when the coherent device memory which is accessible from all available
compute nodes without having local affinity to any compute node other than
the device compute which may or not be represented as a NUMA node in itself.

But in case of normally system memory also, a memory only node might be far
from other CPU nodes and may not have CPUs of it's own. In that case there
is no local affinity anyways.

> 
> A node containing CPUs or other DMA devices that can initiate memory
> access are referred to as "memory iniators". A "memory target" is a

Memory initiators should also include heterogeneous compute elements like
GPU cores, FPGA elements etc apart from CPU and DMA engines.

> node that provides at least one phyiscal address range accessible to a
> memory initiator.

This definition for "memory target" makes sense. Coherent accesses within
PA range from all possible "memory initiators" which should also include
heterogeneous compute elements as mentioned before.

> 
> In preparation for these systems, provide a new kernel API to link
> the target memory node to its initiator compute node with symlinks to
> each other.

Makes sense but how would we really define NUMA placement for various
heterogeneous compute elements which are connected differently to the
system bus differently than the CPU and DMA. 

> 
> The following example shows the new sysfs hierarchy setup for memory node
> 'Y' local to commpute node 'X':
> 
>   # ls -l /sys/devices/system/node/nodeX/initiator*
>   /sys/devices/system/node/nodeX/targetY -> ../nodeY
> 
>   # ls -l /sys/devices/system/node/nodeY/target*
>   /sys/devices/system/node/nodeY/initiatorX -> ../nodeX

This inter linking makes sense but once we are able to define all possible
memory initiators and memory targets as NUMA nodes (which might not very
trivial) taking into account heterogeneous compute environment. But this
linking at least establishes the coherency relationship between memory
initiators and memory targets.

> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 32 ++++++++++++++++++++++++++++++++
>  include/linux/node.h |  2 ++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..a9b7512a9502 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -372,6 +372,38 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
>  				 kobject_name(&node_devices[nid]->dev.kobj));
>  }
>  
> +int register_memory_node_under_compute_node(unsigned int m, unsigned int p)
> +{
> +	int ret;
> +	char initiator[20], target[17];

20, 17 seems arbitrary here.

> +
> +	if (!node_online(p) || !node_online(m))
> +		return -ENODEV;

Just wondering how a NUMA node for group of GPU compute elements will look
like which are not manage by kernel but are still memory initiators having
access to a number of memory targets.

> +	if (m == p)
> +		return 0;

Why skip ? Should not we link memory target to it's own node which can be
it's memory initiator as well. Caller of this linking function might decide
on whether the memory target is accessible from same NUMA node as a memory
initiator or not.

> +
> +	snprintf(initiator, sizeof(initiator), "initiator%d", p);
> +	snprintf(target, sizeof(target), "target%d", m);
> +
> +	ret = sysfs_create_link(&node_devices[p]->dev.kobj,
> +				&node_devices[m]->dev.kobj,
> +				target);
> +	if (ret)
> +		return ret;
> +
> +	ret = sysfs_create_link(&node_devices[m]->dev.kobj,
> +				&node_devices[p]->dev.kobj,
> +				initiator);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> + err:
> +	sysfs_remove_link(&node_devices[p]->dev.kobj,
> +			  kobject_name(&node_devices[m]->dev.kobj));
> +	return ret;
> +}
> +
>  int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
>  {
>  	struct device *obj;
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 257bb3d6d014..1fd734a3fb3f 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -75,6 +75,8 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>  					   unsigned long phys_index);
>  
> +extern int register_memory_node_under_compute_node(unsigned int m, unsigned int p);
> +
>  #ifdef CONFIG_HUGETLBFS
>  extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
>  					 node_registration_func_t unregister);
>
The code is all good but as mentioned before the primary concern is whether
this semantics will be able to correctly represent all possible present and
future heterogeneous compute environments with multi attribute memory. This
is going to be a kernel API. So apart from various NUMA representation for
all possible kinds, the interface has to be abstract with generic elements
and room for future extension.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-15 14:59   ` Keith Busch
  2018-11-15 17:50     ` Dan Williams
  2018-11-15 20:36     ` Matthew Wilcox
@ 2018-11-19  2:52     ` Anshuman Khandual
  2 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2018-11-19  2:52 UTC (permalink / raw)
  To: Keith Busch, Matthew Wilcox
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams



On 11/15/2018 08:29 PM, Keith Busch wrote:
> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
>> On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
>>> Memory-only nodes will often have affinity to a compute node, and
>>> platforms have ways to express that locality relationship.
>>>
>>> A node containing CPUs or other DMA devices that can initiate memory
>>> access are referred to as "memory iniators". A "memory target" is a
>>> node that provides at least one phyiscal address range accessible to a
>>> memory initiator.
>>
>> I think I may be confused here.  If there is _no_ link from node X to
>> node Y, does that mean that node X's CPUs cannot access the memory on
>> node Y?  In my mind, all nodes can access all memory in the system,
>> just not with uniform bandwidth/latency.
> 
> The link is just about which nodes are "local". It's like how nodes have
> a cpulist. Other CPUs not in the node's list can acces that node's memory,
> but the ones in the mask are local, and provide useful optimization hints.
> 
> Would a node mask would be prefered to symlinks?

Having hint for local affinity is definitely a plus but this must provide
the coherency matrix to the user preferably in the form of a nodemask for
each memory target.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-15 17:50     ` Dan Williams
@ 2018-11-19  3:04       ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2018-11-19  3:04 UTC (permalink / raw)
  To: Dan Williams, Keith Busch
  Cc: Matthew Wilcox, Linux Kernel Mailing List, Linux ACPI, Linux MM,
	Greg KH, Rafael J. Wysocki, Dave Hansen



On 11/15/2018 11:20 PM, Dan Williams wrote:
> On Thu, Nov 15, 2018 at 7:02 AM Keith Busch <keith.busch@intel.com> wrote:
>>
>> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
>>> On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
>>>> Memory-only nodes will often have affinity to a compute node, and
>>>> platforms have ways to express that locality relationship.
>>>>
>>>> A node containing CPUs or other DMA devices that can initiate memory
>>>> access are referred to as "memory iniators". A "memory target" is a
>>>> node that provides at least one phyiscal address range accessible to a
>>>> memory initiator.
>>>
>>> I think I may be confused here.  If there is _no_ link from node X to
>>> node Y, does that mean that node X's CPUs cannot access the memory on
>>> node Y?  In my mind, all nodes can access all memory in the system,
>>> just not with uniform bandwidth/latency.
>>
>> The link is just about which nodes are "local". It's like how nodes have
>> a cpulist. Other CPUs not in the node's list can acces that node's memory,
>> but the ones in the mask are local, and provide useful optimization hints.
>>
>> Would a node mask would be prefered to symlinks?
> 
> I think that would be more flexible, because the set of initiators
> that may have "best" or "local" access to a target may be more than 1.

Right. The memory target should have two nodemasks (for now at least). One
enumerating which initiator nodes can access the memory coherently and the
other one which are nearer and can benefit from local allocation.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-16 18:32       ` Keith Busch
@ 2018-11-19  3:15         ` Anshuman Khandual
  2018-11-19 15:49           ` Keith Busch
  2018-12-04 15:43         ` Aneesh Kumar K.V
  1 sibling, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2018-11-19  3:15 UTC (permalink / raw)
  To: Keith Busch, Matthew Wilcox
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams



On 11/17/2018 12:02 AM, Keith Busch wrote:
> On Thu, Nov 15, 2018 at 12:36:54PM -0800, Matthew Wilcox wrote:
>> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
>>> On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
>>>> On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
>>>>> Memory-only nodes will often have affinity to a compute node, and
>>>>> platforms have ways to express that locality relationship.
>>>>>
>>>>> A node containing CPUs or other DMA devices that can initiate memory
>>>>> access are referred to as "memory iniators". A "memory target" is a
>>>>> node that provides at least one phyiscal address range accessible to a
>>>>> memory initiator.
>>>>
>>>> I think I may be confused here.  If there is _no_ link from node X to
>>>> node Y, does that mean that node X's CPUs cannot access the memory on
>>>> node Y?  In my mind, all nodes can access all memory in the system,
>>>> just not with uniform bandwidth/latency.
>>>
>>> The link is just about which nodes are "local". It's like how nodes have
>>> a cpulist. Other CPUs not in the node's list can acces that node's memory,
>>> but the ones in the mask are local, and provide useful optimization hints.
>>
>> So ... let's imagine a hypothetical system (I've never seen one built like
>> this, but it doesn't seem too implausible).  Connect four CPU sockets in
>> a square, each of which has some regular DIMMs attached to it.  CPU A is
>> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
>> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
>> special memory extender device attached on the PCIe bus.  Now there's
>> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
>> not as local as Memory B is ... and we'd probably _prefer_ to allocate
>> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
>> this seems hard.
> 
> Indeed, that particular example is out of scope for this series. The
> first objective is to aid a process running in node B's CPUs to allocate
> memory in B1. Anything that crosses QPI are their own.

This is problematic. Any new kernel API interface should accommodate B2 type
memory as well from the above example which is on a PCIe bus. Because
eventually they would be represented as some sort of a NUMA node and then
applications will have to depend on this sysfs interface for their desired
memory placement requirements. Unless this interface is thought through for
B2 type of memory, it might not be extensible in the future.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/7] node: Add heterogenous memory performance
  2018-11-14 22:49 ` [PATCH 2/7] node: Add heterogenous memory performance Keith Busch
@ 2018-11-19  3:35   ` Anshuman Khandual
  2018-11-19 15:46     ` Keith Busch
  2018-11-27  7:00   ` Dan Williams
  1 sibling, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2018-11-19  3:35 UTC (permalink / raw)
  To: Keith Busch, linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams



On 11/15/2018 04:19 AM, Keith Busch wrote:
> Heterogeneous memory systems provide memory nodes with latency
> and bandwidth performance attributes that are different from other
> nodes. Create an interface for the kernel to register these attributes

There are other properties like power consumption, reliability which can
be associated with a particular PA range. Also the set of properties has
to be extensible for the future.

> under the node that provides the memory. If the system provides this
> information, applications can query the node attributes when deciding
> which node to request memory.

Right but each (memory initiator, memory target) should have these above
mentioned properties enumerated to have an 'property as seen' from kind
of semantics.

> 
> When multiple memory initiators exist, accessing the same memory target
> from each may not perform the same as the other. The highest performing
> initiator to a given target is considered to be a local initiator for
> that target. The kernel provides performance attributes only for the
> local initiators.

As mentioned above the interface must enumerate a future extensible set
of properties for each (memory initiator, memory target) pair available
on the system.

> 
> The memory's compute node should be symlinked in sysfs as one of the
> node's initiators.

Right. IIUC the first patch skips the linking process of for two nodes A
and B if (A == B) preventing association to local memory initiator.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/7] node: Add memory caching attributes
  2018-11-14 22:49 ` [PATCH 4/7] node: Add memory caching attributes Keith Busch
  2018-11-15  0:40   ` Dave Hansen
@ 2018-11-19  4:14   ` Anshuman Khandual
  2018-11-19 23:06     ` Keith Busch
  2018-11-26 19:06   ` Greg Kroah-Hartman
  2018-11-26 19:06   ` Greg Kroah-Hartman
  3 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2018-11-19  4:14 UTC (permalink / raw)
  To: Keith Busch, linux-kernel, linux-acpi, linux-mm
  Cc: Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams



On 11/15/2018 04:19 AM, Keith Busch wrote:
> System memory may have side caches to help improve access speed. While
> the system provided cache is transparent to the software accessing
> these memory ranges, applications can optimize their own access based
> on cache attributes.

Cache is not a separate memory attribute. It impacts how the real attributes
like bandwidth, latency e.g which are already captured in the previous patch.
What is the purpose of adding this as a separate attribute ? Can you explain
how this is going to help the user space apart from the hints it has already
received with bandwidth, latency etc properties.

> 
> In preparation for such systems, provide a new API for the kernel to
> register these memory side caches under the memory node that provides it.

Under target memory node interface /sys/devices/system/node/nodeY/target* ?

> 
> The kernel's sysfs representation is modeled from the cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> cacheinfo, though, a higher node's memory cache level is nearer to the
> CPU, while lower levels are closer to the backing memory. Also unlike
> CPU cache, the system handles flushing any dirty cached memory to the
> last level the memory on a power failure if the range is persistent.

Lets assume that a CPU has got four levels of caches L1, L2, L3, L4 before
reaching memory. L4 is the backing cache for the memory and L1-L3 is from
CPU till the system bus. Hence some of them will be represented as CPU
caches and some of them will be represented as memory caches ?

/sys/devices/system/cpu/cpuX/cache/ --> L1, L2, L3
/sys/devices/system/node/nodeY/target --> L4 

L4 will be listed even if the node is memory only ?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/7] acpi: Create subtable parsing infrastructure
  2018-11-14 22:49 ` [PATCH 6/7] acpi: Create subtable parsing infrastructure Keith Busch
@ 2018-11-19  9:58   ` Rafael J. Wysocki
  2018-11-19 18:36     ` Keith Busch
  0 siblings, 1 reply; 43+ messages in thread
From: Rafael J. Wysocki @ 2018-11-19  9:58 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman,
	Rafael J. Wysocki, Dave Hansen, Dan Williams

On Wed, Nov 14, 2018 at 11:53 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Parsing entries in an ACPI table had assumed a generic header structure
> that is most common. There is no standard ACPI header, though, so less
> common types would need custom parsers if they want go walk their
> subtable entry list.
>
> Create the infrastructure for adding different table types so parsing
> the entries array may be more reused for all ACPI system tables.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/acpi/tables.c | 75 ++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 65 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
> index 61203eebf3a1..15ee77780f68 100644
> --- a/drivers/acpi/tables.c
> +++ b/drivers/acpi/tables.c
> @@ -49,6 +49,19 @@ static struct acpi_table_desc initial_tables[ACPI_MAX_TABLES] __initdata;
>
>  static int acpi_apic_instance __initdata;
>
> +enum acpi_subtable_type {
> +       ACPI_SUBTABLE_COMMON,
> +};
> +
> +union acpi_subtable_headers {
> +       struct acpi_subtable_header common;
> +};
> +
> +struct acpi_subtable_entry {
> +       union acpi_subtable_headers *hdr;
> +       enum acpi_subtable_type type;
> +};
> +
>  /*
>   * Disable table checksum verification for the early stage due to the size
>   * limitation of the current x86 early mapping implementation.
> @@ -217,6 +230,45 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
>         }
>  }
>
> +static unsigned long __init
> +acpi_get_entry_type(struct acpi_subtable_entry *entry)
> +{
> +       switch (entry->type) {
> +       case ACPI_SUBTABLE_COMMON:
> +               return entry->hdr->common.type;
> +       }
> +       WARN_ONCE(1, "invalid acpi type\n");
> +       return 0;
> +}
> +
> +static unsigned long __init
> +acpi_get_entry_length(struct acpi_subtable_entry *entry)
> +{
> +       switch (entry->type) {
> +       case ACPI_SUBTABLE_COMMON:
> +               return entry->hdr->common.length;
> +       }
> +       WARN_ONCE(1, "invalid acpi type\n");

AFAICS this does a WARN_ONCE() on information obtained from firmware.

That is not a kernel problem, so generating traces in that case is not
a good idea IMO.  Moreover, users can't really do much about this in
the majority of cases, so a pr_info() message should be sufficient.

And similarly below.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/7] node: Add heterogenous memory performance
  2018-11-19  3:35   ` Anshuman Khandual
@ 2018-11-19 15:46     ` Keith Busch
  2018-11-22 13:22       ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Keith Busch @ 2018-11-19 15:46 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Mon, Nov 19, 2018 at 09:05:07AM +0530, Anshuman Khandual wrote:
> On 11/15/2018 04:19 AM, Keith Busch wrote:
> > Heterogeneous memory systems provide memory nodes with latency
> > and bandwidth performance attributes that are different from other
> > nodes. Create an interface for the kernel to register these attributes
> 
> There are other properties like power consumption, reliability which can
> be associated with a particular PA range. Also the set of properties has
> to be extensible for the future.

Sure, I'm just starting with the attributes available from HMAT, 
If there are additional possible attributes that make sense to add, I
don't see why we can't continue appending them if this patch is okay.
 
> > under the node that provides the memory. If the system provides this
> > information, applications can query the node attributes when deciding
> > which node to request memory.
> 
> Right but each (memory initiator, memory target) should have these above
> mentioned properties enumerated to have an 'property as seen' from kind
> of semantics.
> 
> > 
> > When multiple memory initiators exist, accessing the same memory target
> > from each may not perform the same as the other. The highest performing
> > initiator to a given target is considered to be a local initiator for
> > that target. The kernel provides performance attributes only for the
> > local initiators.
> 
> As mentioned above the interface must enumerate a future extensible set
> of properties for each (memory initiator, memory target) pair available
> on the system.

That seems less friendly to use if forces the application to figure out
which CPU is the best for a given memory node rather than just provide
that answer directly.

> > The memory's compute node should be symlinked in sysfs as one of the
> > node's initiators.
> 
> Right. IIUC the first patch skips the linking process of for two nodes A
> and B if (A == B) preventing association to local memory initiator.

Right, CPUs and memory sharing a proximity domain are assumed to be
local to each other, so not going to set up those links to itself.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-19  3:15         ` Anshuman Khandual
@ 2018-11-19 15:49           ` Keith Busch
  0 siblings, 0 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-19 15:49 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Matthew Wilcox, linux-kernel, linux-acpi, linux-mm,
	Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams

On Mon, Nov 19, 2018 at 08:45:25AM +0530, Anshuman Khandual wrote:
> On 11/17/2018 12:02 AM, Keith Busch wrote:
> > On Thu, Nov 15, 2018 at 12:36:54PM -0800, Matthew Wilcox wrote:
> >> So ... let's imagine a hypothetical system (I've never seen one built like
> >> this, but it doesn't seem too implausible).  Connect four CPU sockets in
> >> a square, each of which has some regular DIMMs attached to it.  CPU A is
> >> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
> >> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
> >> special memory extender device attached on the PCIe bus.  Now there's
> >> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
> >> not as local as Memory B is ... and we'd probably _prefer_ to allocate
> >> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
> >> this seems hard.
> > 
> > Indeed, that particular example is out of scope for this series. The
> > first objective is to aid a process running in node B's CPUs to allocate
> > memory in B1. Anything that crosses QPI are their own.
> 
> This is problematic. Any new kernel API interface should accommodate B2 type
> memory as well from the above example which is on a PCIe bus. Because
> eventually they would be represented as some sort of a NUMA node and then
> applications will have to depend on this sysfs interface for their desired
> memory placement requirements. Unless this interface is thought through for
> B2 type of memory, it might not be extensible in the future.

I'm not sure I understand the concern. The proposal allows linking B
to B2 memory.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/7] acpi: Create subtable parsing infrastructure
  2018-11-19  9:58   ` Rafael J. Wysocki
@ 2018-11-19 18:36     ` Keith Busch
  0 siblings, 0 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-19 18:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, ACPI Devel Maling List,
	Linux Memory Management List, Greg Kroah-Hartman, Dave Hansen,
	Dan Williams

On Mon, Nov 19, 2018 at 10:58:12AM +0100, Rafael J. Wysocki wrote:
> > +static unsigned long __init
> > +acpi_get_entry_length(struct acpi_subtable_entry *entry)
> > +{
> > +       switch (entry->type) {
> > +       case ACPI_SUBTABLE_COMMON:
> > +               return entry->hdr->common.length;
> > +       }
> > +       WARN_ONCE(1, "invalid acpi type\n");
> 
> AFAICS this does a WARN_ONCE() on information obtained from firmware.
> 
> That is not a kernel problem, so generating traces in that case is not
> a good idea IMO.  Moreover, users can't really do much about this in
> the majority of cases, so a pr_info() message should be sufficient.

Sure thing, I'll fix that up for the next revision.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/7] node: Add memory caching attributes
  2018-11-19  4:14   ` Anshuman Khandual
@ 2018-11-19 23:06     ` Keith Busch
  2018-11-22 13:29       ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Keith Busch @ 2018-11-19 23:06 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Mon, Nov 19, 2018 at 09:44:00AM +0530, Anshuman Khandual wrote:
> On 11/15/2018 04:19 AM, Keith Busch wrote:
> > System memory may have side caches to help improve access speed. While
> > the system provided cache is transparent to the software accessing
> > these memory ranges, applications can optimize their own access based
> > on cache attributes.
> 
> Cache is not a separate memory attribute. It impacts how the real attributes
> like bandwidth, latency e.g which are already captured in the previous patch.
> What is the purpose of adding this as a separate attribute ? Can you explain
> how this is going to help the user space apart from the hints it has already
> received with bandwidth, latency etc properties.

I am not sure I understand the question here. Access bandwidth and latency
are entirely attributes different than what this patch provides. If the
system side-caches memory, the associativity, line size, and total size
can optionally be used by software to improve performance.
 
> > In preparation for such systems, provide a new API for the kernel to
> > register these memory side caches under the memory node that provides it.
> 
> Under target memory node interface /sys/devices/system/node/nodeY/target* ?

Yes.
 
> > 
> > The kernel's sysfs representation is modeled from the cpu cacheinfo
> > attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> > cacheinfo, though, a higher node's memory cache level is nearer to the
> > CPU, while lower levels are closer to the backing memory. Also unlike
> > CPU cache, the system handles flushing any dirty cached memory to the
> > last level the memory on a power failure if the range is persistent.
> 
> Lets assume that a CPU has got four levels of caches L1, L2, L3, L4 before
> reaching memory. L4 is the backing cache for the memory 

I don't quite understand this question either. The cache doesn't back
the memory; the system side caches access to memory.

> and L1-L3 is from
> CPU till the system bus. Hence some of them will be represented as CPU
> caches and some of them will be represented as memory caches ?
>
> /sys/devices/system/cpu/cpuX/cache/ --> L1, L2, L3
> /sys/devices/system/node/nodeY/target --> L4 
> 
> L4 will be listed even if the node is memory only ?

The system provided memory side caches are independent of the
CPU. I'm just providing the CPU caches as a more familiar example to
compare/contrast system memory cache attributes.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/7] doc/vm: New documentation for memory performance
  2018-11-14 22:49 ` [PATCH 3/7] doc/vm: New documentation for " Keith Busch
  2018-11-15 12:59   ` Jonathan Cameron
@ 2018-11-20 13:51   ` Mike Rapoport
  2018-11-20 15:31     ` Keith Busch
  1 sibling, 1 reply; 43+ messages in thread
From: Mike Rapoport @ 2018-11-20 13:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

Hi,

Sorry if I'm jumping too late.

On Wed, Nov 14, 2018 at 03:49:16PM -0700, Keith Busch wrote:
> Platforms may provide system memory where some physical address ranges
> perform differently than others. These heterogeneous memory attributes are
> common to the node that provides the memory and exported by the kernel.
> 
> Add new documentation providing a brief overview of such systems and
> the attributes the kernel makes available to aid applications wishing
> to query this information.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/vm/numaperf.rst | 71 +++++++++++++++++++++++++++++++++++++++++++

As this document describes user-space interfaces it belongs to
Documentation/admin-guide/mm.

>  1 file changed, 71 insertions(+)
>  create mode 100644 Documentation/vm/numaperf.rst
> 
> diff --git a/Documentation/vm/numaperf.rst b/Documentation/vm/numaperf.rst
> new file mode 100644
> index 000000000000..5a3ecaff5474
> --- /dev/null
> +++ b/Documentation/vm/numaperf.rst
> @@ -0,0 +1,71 @@
> +.. _numaperf:
> +
> +================
> +NUMA Performance
> +================
> +
> +Some platforms may have multiple types of memory attached to a single
> +CPU. These disparate memory ranges share some characteristics, such as
> +CPU cache coherence, but may have different performance. For example,
> +different media types and buses affect bandwidth and latency.
> +
> +A system supporting such heterogeneous memory groups each memory type
> +under different "nodes" based on similar CPU locality and performance
> +characteristics.  Some memory may share the same node as a CPU, and
> +others are provided as memory-only nodes. While memory only nodes do not
> +provide CPUs, they may still be local to one or more compute nodes. The
> +following diagram shows one such example of two compute noes with local
> +memory and a memory only node for each of compute node:
> +
> + +------------------+     +------------------+
> + | Compute Node 0   +-----+ Compute Node 1   |
> + | Local Node0 Mem  |     | Local Node1 Mem  |
> + +--------+---------+     +--------+---------+
> +          |                        |
> + +--------+---------+     +--------+---------+
> + | Slower Node2 Mem |     | Slower Node3 Mem |
> + +------------------+     +--------+---------+
> +
> +A "memory initiator" is a node containing one or more devices such as
> +CPUs or separate memory I/O devices that can initiate memory requests. A
> +"memory target" is a node containing one or more CPU-accessible physical
> +address ranges.
> +
> +When multiple memory initiators exist, accessing the same memory
> +target may not perform the same as each other. The highest performing
> +initiator to a given target is considered to be one of that target's
> +local initiators.
> +
> +To aid applications matching memory targets with their initiators,
> +the kernel provide symlinks to each other like the following example::
> +
> +	# ls -l /sys/devices/system/node/nodeX/initiator*
> +	/sys/devices/system/node/nodeX/targetY -> ../nodeY
> +
> +	# ls -l /sys/devices/system/node/nodeY/target*
> +	/sys/devices/system/node/nodeY/initiatorX -> ../nodeX
> +
> +Applications may wish to consider which node they want their memory to
> +be allocated from based on the nodes performance characteristics. If
> +the system provides these attributes, the kernel exports them under the
> +node sysfs hierarchy by appending the initiator_access directory under
> +the node as follows::
> +
> +	/sys/devices/system/node/nodeY/initiator_access/
> +
> +The kernel does not provide performance attributes for non-local memory
> +initiators. The performance characteristics the kernel provides for
> +the local initiators are exported are as follows::
> +
> +	# tree /sys/devices/system/node/nodeY/initiator_access
> +	/sys/devices/system/node/nodeY/initiator_access
> +	|-- read_bandwidth
> +	|-- read_latency
> +	|-- write_bandwidth
> +	`-- write_latency
> +
> +The bandwidth attributes are provided in MiB/second.
> +
> +The latency attributes are provided in nanoseconds.
> +
> +See also: https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> -- 
> 2.14.4
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/7] doc/vm: New documentation for memory cache
  2018-11-14 22:49 ` [PATCH 5/7] doc/vm: New documentation for memory cache Keith Busch
  2018-11-15  0:41   ` Dave Hansen
  2018-11-15 13:16   ` Jonathan Cameron
@ 2018-11-20 13:53   ` Mike Rapoport
  2 siblings, 0 replies; 43+ messages in thread
From: Mike Rapoport @ 2018-11-20 13:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Wed, Nov 14, 2018 at 03:49:18PM -0700, Keith Busch wrote:
> Platforms may provide system memory that contains side caches to help
> spped up access. These memory caches are part of a memory node and
> the cache attributes are exported by the kernel.
> 
> Add new documentation providing a brief overview of system memory side
> caches and the kernel provided attributes for application optimization.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/vm/numacache.rst | 76 ++++++++++++++++++++++++++++++++++++++++++

I think it's better to have both numaperf and numacache in a single
document, as they are quite related.

>  1 file changed, 76 insertions(+)
>  create mode 100644 Documentation/vm/numacache.rst
> 
> diff --git a/Documentation/vm/numacache.rst b/Documentation/vm/numacache.rst
> new file mode 100644
> index 000000000000..e79c801b7e3b
> --- /dev/null
> +++ b/Documentation/vm/numacache.rst
> @@ -0,0 +1,76 @@
> +.. _numacache:
> +
> +==========
> +NUMA Cache
> +==========
> +
> +System memory may be constructed in a hierarchy of various performing
> +characteristics in order to provide large address space of slower
> +performing memory cached by a smaller size of higher performing
> +memory. The system physical addresses that software is aware of see
> +is provided by the last memory level in the hierarchy, while higher
> +performing memory transparently provides caching to slower levels.
> +
> +The term "far memory" is used to denote the last level memory in the
> +hierarchy. Each increasing cache level provides higher performing CPU
> +access, and the term "near memory" represents the highest level cache
> +provided by the system. This number is different than CPU caches where
> +the cache level (ex: L1, L2, L3) uses a CPU centric view with each level
> +being lower performing and closer to system memory. The memory cache
> +level is centric to the last level memory, so the higher numbered cache
> +level denotes memory nearer to the CPU, and further from far memory.
> +
> +The memory side caches are not directly addressable by software. When
> +software accesses a system address, the system will return it from the
> +near memory cache if it is present. If it is not present, the system
> +accesses the next level of memory until there is either a hit in that
> +cache level, or it reaches far memory.
> +
> +In order to maximize the performance out of such a setup, software may
> +wish to query the memory cache attributes. If the system provides a way
> +to query this information, for example with ACPI HMAT (Heterogeneous
> +Memory Attribute Table)[1], the kernel will append these attributes to
> +the NUMA node that provides the memory.
> +
> +When the kernel first registers a memory cache with a node, the kernel
> +will create the following directory::
> +
> +	/sys/devices/system/node/nodeX/cache/
> +
> +If that directory is not present, then either the memory does not have
> +a side cache, or that information is not provided to the kernel.
> +
> +The attributes for each level of cache is provided under its cache
> +level index::
> +
> +	/sys/devices/system/node/nodeX/cache/indexA/
> +	/sys/devices/system/node/nodeX/cache/indexB/
> +	/sys/devices/system/node/nodeX/cache/indexC/
> +
> +Each cache level's directory provides its attributes. For example,
> +the following is a single cache level and the attributes available for
> +software to query::
> +
> +	# tree sys/devices/system/node/node0/cache/
> +	/sys/devices/system/node/node0/cache/
> +	|-- index1
> +	|   |-- associativity
> +	|   |-- level
> +	|   |-- line_size
> +	|   |-- size
> +	|   `-- write_policy
> +
> +The cache "associativity" will be 0 if it is a direct-mapped cache, and
> +non-zero for any other indexed based, multi-way associativity.
> +
> +The "level" is the distance from the far memory, and matches the number
> +appended to its "index" directory.
> +
> +The "line_size" is the number of bytes accessed on a cache miss.
> +
> +The "size" is the number of bytes provided by this cache level.
> +
> +The "write_policy" will be 0 for write-back, and non-zero for
> +write-through caching.
> +
> +[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> -- 
> 2.14.4
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/7] doc/vm: New documentation for memory performance
  2018-11-20 13:51   ` Mike Rapoport
@ 2018-11-20 15:31     ` Keith Busch
  0 siblings, 0 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-20 15:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Tue, Nov 20, 2018 at 02:51:50PM +0100, Mike Rapoport wrote:
> On Wed, Nov 14, 2018 at 03:49:16PM -0700, Keith Busch wrote:
> > Platforms may provide system memory where some physical address ranges
> > perform differently than others. These heterogeneous memory attributes are
> > common to the node that provides the memory and exported by the kernel.
> > 
> > Add new documentation providing a brief overview of such systems and
> > the attributes the kernel makes available to aid applications wishing
> > to query this information.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  Documentation/vm/numaperf.rst | 71 +++++++++++++++++++++++++++++++++++++++++++
> 
> As this document describes user-space interfaces it belongs to
> Documentation/admin-guide/mm.

Thanks for the feedback. I'll move this and combine with the memory
cache doc in the v2.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/7] node: Add heterogenous memory performance
  2018-11-19 15:46     ` Keith Busch
@ 2018-11-22 13:22       ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2018-11-22 13:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams



On 11/19/2018 09:16 PM, Keith Busch wrote:
> On Mon, Nov 19, 2018 at 09:05:07AM +0530, Anshuman Khandual wrote:
>> On 11/15/2018 04:19 AM, Keith Busch wrote:
>>> Heterogeneous memory systems provide memory nodes with latency
>>> and bandwidth performance attributes that are different from other
>>> nodes. Create an interface for the kernel to register these attributes
>>
>> There are other properties like power consumption, reliability which can
>> be associated with a particular PA range. Also the set of properties has
>> to be extensible for the future.
> 
> Sure, I'm just starting with the attributes available from HMAT, 
> If there are additional possible attributes that make sense to add, I
> don't see why we can't continue appending them if this patch is okay.

As I mentioned on the other thread

1) The interface needs to be compact to avoid large number of files
2) Single U64 will be able to handle 8 attributes with 8 bit values
3) 8 bit value needs needs to be arch independent and abstracted out

I guess 8 attributes should be good enough for all type of memory in
foreseeable future.

>  
>>> under the node that provides the memory. If the system provides this
>>> information, applications can query the node attributes when deciding
>>> which node to request memory.
>>
>> Right but each (memory initiator, memory target) should have these above
>> mentioned properties enumerated to have an 'property as seen' from kind
>> of semantics.
>>
>>>
>>> When multiple memory initiators exist, accessing the same memory target
>>> from each may not perform the same as the other. The highest performing
>>> initiator to a given target is considered to be a local initiator for
>>> that target. The kernel provides performance attributes only for the
>>> local initiators.
>>
>> As mentioned above the interface must enumerate a future extensible set
>> of properties for each (memory initiator, memory target) pair available
>> on the system.
> 
> That seems less friendly to use if forces the application to figure out
> which CPU is the best for a given memory node rather than just provide
> that answer directly.

Why ? The application would just have to scan all possible values out
there and decide for itself. A complete set of attribute values for
each pair makes the sysfs more comprehensive and gives the application
more control over it's choices.

> 
>>> The memory's compute node should be symlinked in sysfs as one of the
>>> node's initiators.
>>
>> Right. IIUC the first patch skips the linking process of for two nodes A
>> and B if (A == B) preventing association to local memory initiator.
> 
> Right, CPUs and memory sharing a proximity domain are assumed to be
> local to each other, so not going to set up those links to itself.

But this will be required for applications to evaluate correctly between
possible values from all node pairs.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/7] node: Add memory caching attributes
  2018-11-19 23:06     ` Keith Busch
@ 2018-11-22 13:29       ` Anshuman Khandual
  2018-11-26 15:14         ` Keith Busch
  0 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2018-11-22 13:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams



On 11/20/2018 04:36 AM, Keith Busch wrote:
> On Mon, Nov 19, 2018 at 09:44:00AM +0530, Anshuman Khandual wrote:
>> On 11/15/2018 04:19 AM, Keith Busch wrote:
>>> System memory may have side caches to help improve access speed. While
>>> the system provided cache is transparent to the software accessing
>>> these memory ranges, applications can optimize their own access based
>>> on cache attributes.
>>
>> Cache is not a separate memory attribute. It impacts how the real attributes
>> like bandwidth, latency e.g which are already captured in the previous patch.
>> What is the purpose of adding this as a separate attribute ? Can you explain
>> how this is going to help the user space apart from the hints it has already
>> received with bandwidth, latency etc properties.
> 
> I am not sure I understand the question here. Access bandwidth and latency
> are entirely attributes different than what this patch provides. If the
> system side-caches memory, the associativity, line size, and total size
> can optionally be used by software to improve performance.

Okay but then does this belong to this series which about memory attributes ?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/7] node: Add memory caching attributes
  2018-11-22 13:29       ` Anshuman Khandual
@ 2018-11-26 15:14         ` Keith Busch
  0 siblings, 0 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-26 15:14 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

On Thu, Nov 22, 2018 at 06:59:21PM +0530, Anshuman Khandual wrote:
> 
> 
> On 11/20/2018 04:36 AM, Keith Busch wrote:
> > On Mon, Nov 19, 2018 at 09:44:00AM +0530, Anshuman Khandual wrote:
> >> On 11/15/2018 04:19 AM, Keith Busch wrote:
> >>> System memory may have side caches to help improve access speed. While
> >>> the system provided cache is transparent to the software accessing
> >>> these memory ranges, applications can optimize their own access based
> >>> on cache attributes.
> >>
> >> Cache is not a separate memory attribute. It impacts how the real attributes
> >> like bandwidth, latency e.g which are already captured in the previous patch.
> >> What is the purpose of adding this as a separate attribute ? Can you explain
> >> how this is going to help the user space apart from the hints it has already
> >> received with bandwidth, latency etc properties.
> > 
> > I am not sure I understand the question here. Access bandwidth and latency
> > are entirely attributes different than what this patch provides. If the
> > system side-caches memory, the associativity, line size, and total size
> > can optionally be used by software to improve performance.
> 
> Okay but then does this belong to this series which about memory attributes ?

This patch series is about exporting memory attributes, and this system
memory caching is  one such attribute, so yes, I think it belongs.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/7] node: Add memory caching attributes
  2018-11-14 22:49 ` [PATCH 4/7] node: Add memory caching attributes Keith Busch
  2018-11-15  0:40   ` Dave Hansen
  2018-11-19  4:14   ` Anshuman Khandual
@ 2018-11-26 19:06   ` Greg Kroah-Hartman
  2018-11-26 19:53     ` Keith Busch
  2018-11-26 19:06   ` Greg Kroah-Hartman
  3 siblings, 1 reply; 43+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-26 19:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Rafael Wysocki, Dave Hansen,
	Dan Williams

On Wed, Nov 14, 2018 at 03:49:17PM -0700, Keith Busch wrote:
> System memory may have side caches to help improve access speed. While
> the system provided cache is transparent to the software accessing
> these memory ranges, applications can optimize their own access based
> on cache attributes.
> 
> In preparation for such systems, provide a new API for the kernel to
> register these memory side caches under the memory node that provides it.
> 
> The kernel's sysfs representation is modeled from the cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> cacheinfo, though, a higher node's memory cache level is nearer to the
> CPU, while lower levels are closer to the backing memory. Also unlike
> CPU cache, the system handles flushing any dirty cached memory to the
> last level the memory on a power failure if the range is persistent.
> 
> The exported attributes are the cache size, the line size, associativity,
> and write back policy.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/base/node.c  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/node.h |  23 ++++++++++
>  2 files changed, 140 insertions(+)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 232535761998..bb94f1d18115 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -60,6 +60,12 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>  
>  #ifdef CONFIG_HMEM
> +struct node_cache_obj {
> +	struct kobject kobj;
> +	struct list_head node;
> +	struct node_cache_attrs cache_attrs;
> +};

I know you all are off in the weeds designing some new crazy api for
this instead of this current proposal (sorry, I lost the thread, I'll
wait for the patches before commenting on it), but I do want to say one
thing here.

NEVER use a raw kobject as a child for a 'struct device' unless you
REALLY REALLY REALLY REALLY know what you are doing and have a VERY good
reason to do so.

Just use a 'struct device', otherwise you end up having to reinvent all
of the core logic that struct device provides you, like attribute
callbacks (which you had to create), and other good stuff like telling
userspace that a device has shown up so it knows to look at it.

That last one is key, a kobject is suddenly a "black hole" in sysfs as
far as userspace knows because it does not see them for the most part
(unless you are mucking around in the filesystem on your own, and
really, don't do that, use a library like the rest of the world unless
you really like reinventing everything, which, from your patchset it
feels like...)

Anyway, use 'struct device'.  That's all.

greg k-h

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/7] node: Add memory caching attributes
  2018-11-14 22:49 ` [PATCH 4/7] node: Add memory caching attributes Keith Busch
                     ` (2 preceding siblings ...)
  2018-11-26 19:06   ` Greg Kroah-Hartman
@ 2018-11-26 19:06   ` Greg Kroah-Hartman
  3 siblings, 0 replies; 43+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-26 19:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-acpi, linux-mm, Rafael Wysocki, Dave Hansen,
	Dan Williams

On Wed, Nov 14, 2018 at 03:49:17PM -0700, Keith Busch wrote:
> System memory may have side caches to help improve access speed. While
> the system provided cache is transparent to the software accessing
> these memory ranges, applications can optimize their own access based
> on cache attributes.
> 
> In preparation for such systems, provide a new API for the kernel to
> register these memory side caches under the memory node that provides it.
> 
> The kernel's sysfs representation is modeled from the cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> cacheinfo, though, a higher node's memory cache level is nearer to the
> CPU, while lower levels are closer to the backing memory. Also unlike
> CPU cache, the system handles flushing any dirty cached memory to the
> last level the memory on a power failure if the range is persistent.
> 
> The exported attributes are the cache size, the line size, associativity,
> and write back policy.

You also didn't document your new sysfs attributes/layout in a
Documentation/ABI/ entry which is required for any sysfs change...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/7] node: Add memory caching attributes
  2018-11-26 19:06   ` Greg Kroah-Hartman
@ 2018-11-26 19:53     ` Keith Busch
  0 siblings, 0 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-26 19:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-kernel, linux-acpi, linux-mm, Rafael Wysocki, Hansen, Dave,
	Williams, Dan J

On Mon, Nov 26, 2018 at 11:06:19AM -0800, Greg Kroah-Hartman wrote:
> On Wed, Nov 14, 2018 at 03:49:17PM -0700, Keith Busch wrote:
> > System memory may have side caches to help improve access speed. While
> > the system provided cache is transparent to the software accessing
> > these memory ranges, applications can optimize their own access based
> > on cache attributes.
> > 
> > In preparation for such systems, provide a new API for the kernel to
> > register these memory side caches under the memory node that provides it.
> > 
> > The kernel's sysfs representation is modeled from the cpu cacheinfo
> > attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU
> > cacheinfo, though, a higher node's memory cache level is nearer to the
> > CPU, while lower levels are closer to the backing memory. Also unlike
> > CPU cache, the system handles flushing any dirty cached memory to the
> > last level the memory on a power failure if the range is persistent.
> > 
> > The exported attributes are the cache size, the line size, associativity,
> > and write back policy.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  drivers/base/node.c  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/node.h |  23 ++++++++++
> >  2 files changed, 140 insertions(+)
> > 
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 232535761998..bb94f1d18115 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -60,6 +60,12 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
> >  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
> >  
> >  #ifdef CONFIG_HMEM
> > +struct node_cache_obj {
> > +	struct kobject kobj;
> > +	struct list_head node;
> > +	struct node_cache_attrs cache_attrs;
> > +};
> 
> I know you all are off in the weeds designing some new crazy api for
> this instead of this current proposal (sorry, I lost the thread, I'll
> wait for the patches before commenting on it), but I do want to say one
> thing here.
> 
> NEVER use a raw kobject as a child for a 'struct device' unless you
> REALLY REALLY REALLY REALLY know what you are doing and have a VERY good
> reason to do so.
> 
> Just use a 'struct device', otherwise you end up having to reinvent all
> of the core logic that struct device provides you, like attribute
> callbacks (which you had to create), and other good stuff like telling
> userspace that a device has shown up so it knows to look at it.
> 
> That last one is key, a kobject is suddenly a "black hole" in sysfs as
> far as userspace knows because it does not see them for the most part
> (unless you are mucking around in the filesystem on your own, and
> really, don't do that, use a library like the rest of the world unless
> you really like reinventing everything, which, from your patchset it
> feels like...)
> 
> Anyway, use 'struct device'.  That's all.
> 
> greg k-h

Okay, thank you for the advice. I prefer to reuse over reinvent. :)

I only used kobject because the power/ directory was automatically
created with 'struct device', but I now I see there are better ways to
suppress that.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/7] node: Add heterogenous memory performance
  2018-11-14 22:49 ` [PATCH 2/7] node: Add heterogenous memory performance Keith Busch
  2018-11-19  3:35   ` Anshuman Khandual
@ 2018-11-27  7:00   ` Dan Williams
  2018-11-27 17:42     ` Dan Williams
  2018-11-27 17:44     ` Keith Busch
  1 sibling, 2 replies; 43+ messages in thread
From: Dan Williams @ 2018-11-27  7:00 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, Linux ACPI, Linux MM, Greg KH,
	Rafael J. Wysocki, Dave Hansen

On Wed, Nov 14, 2018 at 2:53 PM Keith Busch <keith.busch@intel.com> wrote:
>
> Heterogeneous memory systems provide memory nodes with latency
> and bandwidth performance attributes that are different from other
> nodes. Create an interface for the kernel to register these attributes
> under the node that provides the memory. If the system provides this
> information, applications can query the node attributes when deciding
> which node to request memory.
>
> When multiple memory initiators exist, accessing the same memory target
> from each may not perform the same as the other. The highest performing
> initiator to a given target is considered to be a local initiator for
> that target. The kernel provides performance attributes only for the
> local initiators.
>
> The memory's compute node should be symlinked in sysfs as one of the
> node's initiators.
>
> The following example shows the new sysfs hierarchy for a node exporting
> performance attributes:
>
>   # tree /sys/devices/system/node/nodeY/initiator_access
>   /sys/devices/system/node/nodeY/initiator_access
>   |-- read_bandwidth
>   |-- read_latency
>   |-- write_bandwidth
>   `-- write_latency

With the expectation that there will be nodes that are initiator-only,
target-only, or both I think this interface should indicate that. The
1:1 "local" designation of HMAT should not be directly encoded in the
interface, it's just a shortcut for finding at least one initiator in
the set that can realize the advertised performance. At least if the
interface can enumerate the set of initiators then it becomes clear
whether sysfs can answer a performance enumeration question or if the
application needs to consult an interface with specific knowledge of a
given initiator-target pairing.

It seems a precursor to these patches is arranges for offline node
devices to be created for the ACPI proximity domains that are
offline-by default for reserved memory ranges.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/7] node: Add heterogenous memory performance
  2018-11-27  7:00   ` Dan Williams
@ 2018-11-27 17:42     ` Dan Williams
  2018-11-27 17:44     ` Keith Busch
  1 sibling, 0 replies; 43+ messages in thread
From: Dan Williams @ 2018-11-27 17:42 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux Kernel Mailing List, Linux ACPI, Linux MM, Greg KH,
	Rafael J. Wysocki, Dave Hansen

On Mon, Nov 26, 2018 at 11:00 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Nov 14, 2018 at 2:53 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > Heterogeneous memory systems provide memory nodes with latency
> > and bandwidth performance attributes that are different from other
> > nodes. Create an interface for the kernel to register these attributes
> > under the node that provides the memory. If the system provides this
> > information, applications can query the node attributes when deciding
> > which node to request memory.
> >
> > When multiple memory initiators exist, accessing the same memory target
> > from each may not perform the same as the other. The highest performing
> > initiator to a given target is considered to be a local initiator for
> > that target. The kernel provides performance attributes only for the
> > local initiators.
> >
> > The memory's compute node should be symlinked in sysfs as one of the
> > node's initiators.
> >
> > The following example shows the new sysfs hierarchy for a node exporting
> > performance attributes:
> >
> >   # tree /sys/devices/system/node/nodeY/initiator_access
> >   /sys/devices/system/node/nodeY/initiator_access
> >   |-- read_bandwidth
> >   |-- read_latency
> >   |-- write_bandwidth
> >   `-- write_latency
>
> With the expectation that there will be nodes that are initiator-only,
> target-only, or both I think this interface should indicate that. The
> 1:1 "local" designation of HMAT should not be directly encoded in the
> interface, it's just a shortcut for finding at least one initiator in
> the set that can realize the advertised performance. At least if the
> interface can enumerate the set of initiators then it becomes clear
> whether sysfs can answer a performance enumeration question or if the
> application needs to consult an interface with specific knowledge of a
> given initiator-target pairing.

Sorry, I misread patch1, this series does allow publishing the
multi-initiator case that shares the same performance profile to a
given target.

> It seems a precursor to these patches is arranges for offline node
> devices to be created for the ACPI proximity domains that are
> offline-by default for reserved memory ranges.

Likely still need this though because node devices don't tend to show
up until they have a cpu or online memory.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/7] node: Add heterogenous memory performance
  2018-11-27  7:00   ` Dan Williams
  2018-11-27 17:42     ` Dan Williams
@ 2018-11-27 17:44     ` Keith Busch
  1 sibling, 0 replies; 43+ messages in thread
From: Keith Busch @ 2018-11-27 17:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, Linux ACPI, Linux MM, Greg KH,
	Rafael J. Wysocki, Hansen, Dave

On Mon, Nov 26, 2018 at 11:00:09PM -0800, Dan Williams wrote:
> On Wed, Nov 14, 2018 at 2:53 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > Heterogeneous memory systems provide memory nodes with latency
> > and bandwidth performance attributes that are different from other
> > nodes. Create an interface for the kernel to register these attributes
> > under the node that provides the memory. If the system provides this
> > information, applications can query the node attributes when deciding
> > which node to request memory.
> >
> > When multiple memory initiators exist, accessing the same memory target
> > from each may not perform the same as the other. The highest performing
> > initiator to a given target is considered to be a local initiator for
> > that target. The kernel provides performance attributes only for the
> > local initiators.
> >
> > The memory's compute node should be symlinked in sysfs as one of the
> > node's initiators.
> >
> > The following example shows the new sysfs hierarchy for a node exporting
> > performance attributes:
> >
> >   # tree /sys/devices/system/node/nodeY/initiator_access
> >   /sys/devices/system/node/nodeY/initiator_access
> >   |-- read_bandwidth
> >   |-- read_latency
> >   |-- write_bandwidth
> >   `-- write_latency
> 
> With the expectation that there will be nodes that are initiator-only,
> target-only, or both I think this interface should indicate that. The
> 1:1 "local" designation of HMAT should not be directly encoded in the
> interface, it's just a shortcut for finding at least one initiator in
> the set that can realize the advertised performance. At least if the
> interface can enumerate the set of initiators then it becomes clear
> whether sysfs can answer a performance enumeration question or if the
> application needs to consult an interface with specific knowledge of a
> given initiator-target pairing.
> 
> It seems a precursor to these patches is arranges for offline node
> devices to be created for the ACPI proximity domains that are
> offline-by default for reserved memory ranges.

The intention is that all initiators symlinked to the memory node share
the initiator_access attributes, as well as itself the node is its own
initiator. There's no limit to how many the new kernel interface in
patch 1/7 allows you to register, so it's not really a 1:1 relationship.

Either instead or in addition to the symlinks, we can export a node_mask
in the initiator_access directory for which these access attributes
apply if that makes the intention more clear.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-11-16 18:32       ` Keith Busch
  2018-11-19  3:15         ` Anshuman Khandual
@ 2018-12-04 15:43         ` Aneesh Kumar K.V
  2018-12-04 16:54           ` Keith Busch
  1 sibling, 1 reply; 43+ messages in thread
From: Aneesh Kumar K.V @ 2018-12-04 15:43 UTC (permalink / raw)
  To: Keith Busch, Matthew Wilcox
  Cc: linux-kernel, linux-acpi, linux-mm, Greg Kroah-Hartman,
	Rafael Wysocki, Dave Hansen, Dan Williams

Keith Busch <keith.busch@intel.com> writes:

> On Thu, Nov 15, 2018 at 12:36:54PM -0800, Matthew Wilcox wrote:
>> On Thu, Nov 15, 2018 at 07:59:20AM -0700, Keith Busch wrote:
>> > On Thu, Nov 15, 2018 at 05:57:10AM -0800, Matthew Wilcox wrote:
>> > > On Wed, Nov 14, 2018 at 03:49:14PM -0700, Keith Busch wrote:
>> > > > Memory-only nodes will often have affinity to a compute node, and
>> > > > platforms have ways to express that locality relationship.
>> > > > 
>> > > > A node containing CPUs or other DMA devices that can initiate memory
>> > > > access are referred to as "memory iniators". A "memory target" is a
>> > > > node that provides at least one phyiscal address range accessible to a
>> > > > memory initiator.
>> > > 
>> > > I think I may be confused here.  If there is _no_ link from node X to
>> > > node Y, does that mean that node X's CPUs cannot access the memory on
>> > > node Y?  In my mind, all nodes can access all memory in the system,
>> > > just not with uniform bandwidth/latency.
>> > 
>> > The link is just about which nodes are "local". It's like how nodes have
>> > a cpulist. Other CPUs not in the node's list can acces that node's memory,
>> > but the ones in the mask are local, and provide useful optimization hints.
>> 
>> So ... let's imagine a hypothetical system (I've never seen one built like
>> this, but it doesn't seem too implausible).  Connect four CPU sockets in
>> a square, each of which has some regular DIMMs attached to it.  CPU A is
>> 0 hops to Memory A, one hop to Memory B and Memory C, and two hops from
>> Memory D (each CPU only has two "QPI" links).  Then maybe there's some
>> special memory extender device attached on the PCIe bus.  Now there's
>> Memory B1 and B2 that's attached to CPU B and it's local to CPU B, but
>> not as local as Memory B is ... and we'd probably _prefer_ to allocate
>> memory for CPU A from Memory B1 than from Memory D.  But ... *mumble*,
>> this seems hard.
>
> Indeed, that particular example is out of scope for this series. The
> first objective is to aid a process running in node B's CPUs to allocate
> memory in B1. Anything that crosses QPI are their own.

But if you can extrapolate how such a system can possibly be expressed
using what is propsed here, it would help in reviewing this. Also how
do we intent to express the locality of memory w.r.t to other computing
units like GPU/FPGA?

I understand that this is looked at as ACPI HMAT in sysfs format.
But as mentioned by others in this thread, if we don't do this platform
and device independent way, we can have application portability issues
going forward?

-aneesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/7] node: Link memory nodes to their compute nodes
  2018-12-04 15:43         ` Aneesh Kumar K.V
@ 2018-12-04 16:54           ` Keith Busch
  0 siblings, 0 replies; 43+ messages in thread
From: Keith Busch @ 2018-12-04 16:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Matthew Wilcox, linux-kernel, linux-acpi, linux-mm,
	Greg Kroah-Hartman, Rafael Wysocki, Dave Hansen, Dan Williams

On Tue, Dec 04, 2018 at 09:13:33PM +0530, Aneesh Kumar K.V wrote:
> Keith Busch <keith.busch@intel.com> writes:
> >
> > Indeed, that particular example is out of scope for this series. The
> > first objective is to aid a process running in node B's CPUs to allocate
> > memory in B1. Anything that crosses QPI are their own.
> 
> But if you can extrapolate how such a system can possibly be expressed
> using what is propsed here, it would help in reviewing this.

Expressed to what end? This proposal is not trying to express anything
other than the best possible pairings because that is the most common
information applications will want to know.

> Also how
> do we intent to express the locality of memory w.r.t to other computing
> units like GPU/FPGA?

The HMAT parsing at the end of the series provides an example for how
others may use the proposed interfaces.

> I understand that this is looked at as ACPI HMAT in sysfs format.
> But as mentioned by others in this thread, if we don't do this platform
> and device independent way, we can have application portability issues
> going forward?

Only the last patch is specific to HMAT. If there are other ways to get
the same attributes, then those drivers or subsystems may also register
them with these new kernel interfaces.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/7] doc/vm: New documentation for memory performance
  2018-11-15 12:59   ` Jonathan Cameron
@ 2018-12-10 16:12     ` Dan Williams
  0 siblings, 0 replies; 43+ messages in thread
From: Dan Williams @ 2018-12-10 16:12 UTC (permalink / raw)
  To: jonathan.cameron
  Cc: Keith Busch, Linux Kernel Mailing List, Linux ACPI, Linux MM,
	Greg KH, Rafael J. Wysocki, Dave Hansen

On Thu, Nov 15, 2018 at 4:59 AM Jonathan Cameron
<jonathan.cameron@huawei.com> wrote:
>
> On Wed, 14 Nov 2018 15:49:16 -0700
> Keith Busch <keith.busch@intel.com> wrote:
[..]
> > +The kernel does not provide performance attributes for non-local memory
> > +initiators. The performance characteristics the kernel provides for
> > +the local initiators are exported are as follows::
> > +
> > +     # tree /sys/devices/system/node/nodeY/initiator_access
> > +     /sys/devices/system/node/nodeY/initiator_access
> > +     |-- read_bandwidth
> > +     |-- read_latency
> > +     |-- write_bandwidth
> > +     `-- write_latency
> > +
> > +The bandwidth attributes are provided in MiB/second.
> > +
> > +The latency attributes are provided in nanoseconds.
> > +
> > +See also: https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>
> My worry here is we are explicitly making an interface that is only ever
> providing "local" node information, where local node is not the best
> defined thing in the world for complex topologies.
>
> I have no problem with that making a sensible starting point for providing
> information userspace knows what to do with, just with an interface that
> in of itself doesn't make that clear.
>
> Perhaps something as simple as
> /sys/devices/system/nodeY/local_initiatorX
> /sys/devices/system/nodeX/local_targetY
>
> That leaves us the option of coming along later and having a full listing
> when a userspace requirement has become clear.  Another option would
> be an exhaustive list of all initiator / memory pairs that exist, with
> an additional sysfs file giving a list of those that are nearest
> to avoid every userspace program having to do the search.

I worry that "local" is an HMAT specific concept when all it is in
actuality is a place for platform firmware to list the "best" or
"primary" access initiators. How about "initiator_classX" with some
documentation that the 0th class captures this primary initiator set.
That leaves the interface a straightforward way to add more classes in
the future, but with no strict association to the class number.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2018-12-10 16:12 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-14 22:49 [PATCH 1/7] node: Link memory nodes to their compute nodes Keith Busch
2018-11-14 22:49 ` [PATCH 2/7] node: Add heterogenous memory performance Keith Busch
2018-11-19  3:35   ` Anshuman Khandual
2018-11-19 15:46     ` Keith Busch
2018-11-22 13:22       ` Anshuman Khandual
2018-11-27  7:00   ` Dan Williams
2018-11-27 17:42     ` Dan Williams
2018-11-27 17:44     ` Keith Busch
2018-11-14 22:49 ` [PATCH 3/7] doc/vm: New documentation for " Keith Busch
2018-11-15 12:59   ` Jonathan Cameron
2018-12-10 16:12     ` Dan Williams
2018-11-20 13:51   ` Mike Rapoport
2018-11-20 15:31     ` Keith Busch
2018-11-14 22:49 ` [PATCH 4/7] node: Add memory caching attributes Keith Busch
2018-11-15  0:40   ` Dave Hansen
2018-11-19  4:14   ` Anshuman Khandual
2018-11-19 23:06     ` Keith Busch
2018-11-22 13:29       ` Anshuman Khandual
2018-11-26 15:14         ` Keith Busch
2018-11-26 19:06   ` Greg Kroah-Hartman
2018-11-26 19:53     ` Keith Busch
2018-11-26 19:06   ` Greg Kroah-Hartman
2018-11-14 22:49 ` [PATCH 5/7] doc/vm: New documentation for memory cache Keith Busch
2018-11-15  0:41   ` Dave Hansen
2018-11-15 13:16   ` Jonathan Cameron
2018-11-20 13:53   ` Mike Rapoport
2018-11-14 22:49 ` [PATCH 6/7] acpi: Create subtable parsing infrastructure Keith Busch
2018-11-19  9:58   ` Rafael J. Wysocki
2018-11-19 18:36     ` Keith Busch
2018-11-14 22:49 ` [PATCH 7/7] acpi/hmat: Parse and report heterogeneous memory Keith Busch
2018-11-15 13:57 ` [PATCH 1/7] node: Link memory nodes to their compute nodes Matthew Wilcox
2018-11-15 14:59   ` Keith Busch
2018-11-15 17:50     ` Dan Williams
2018-11-19  3:04       ` Anshuman Khandual
2018-11-15 20:36     ` Matthew Wilcox
2018-11-16 18:32       ` Keith Busch
2018-11-19  3:15         ` Anshuman Khandual
2018-11-19 15:49           ` Keith Busch
2018-12-04 15:43         ` Aneesh Kumar K.V
2018-12-04 16:54           ` Keith Busch
2018-11-16 22:55       ` Dan Williams
2018-11-19  2:52     ` Anshuman Khandual
2018-11-19  2:46 ` Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).