From: Jonathan Cameron <jonathan.cameron@huawei.com> To: Keith Busch <keith.busch@intel.com> Cc: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Rafael Wysocki <rafael@kernel.org>, Dave Hansen <dave.hansen@intel.com>, Dan Williams <dan.j.williams@intel.com> Subject: Re: [PATCHv7 10/10] doc/mm: New documentation for memory performance Date: Mon, 11 Mar 2019 11:38:43 +0000 [thread overview] Message-ID: <20190311113843.00006b47@huawei.com> (raw) In-Reply-To: <20190227225038.20438-11-keith.busch@intel.com> On Wed, 27 Feb 2019 15:50:38 -0700 Keith Busch <keith.busch@intel.com> wrote: > Platforms may provide system memory where some physical address ranges > perform differently than others, or is side cached by the system. The magic 'side cached' term still here in the patch description, ideally wants cleaning up. > > Add documentation describing a high level overview of such systems and the > perforamnce and caching attributes the kernel provides for applications performance > wishing to query this information. > > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> > Signed-off-by: Keith Busch <keith.busch@intel.com> A few comments inline. Mostly the weird corner cases that I miss understood in one of the earlier versions of the code. Whilst I think perhaps that one section could be tweaked a tiny bit I'm basically happy with this if you don't want to. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > Documentation/admin-guide/mm/numaperf.rst | 164 ++++++++++++++++++++++++++++++ > 1 file changed, 164 insertions(+) > create mode 100644 Documentation/admin-guide/mm/numaperf.rst > > diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst > new file mode 100644 > index 000000000000..d32756b9be48 > --- /dev/null > +++ b/Documentation/admin-guide/mm/numaperf.rst > @@ -0,0 +1,164 @@ > +.. _numaperf: > + > +============= > +NUMA Locality > +============= > + > +Some platforms may have multiple types of memory attached to a compute > +node. These disparate memory ranges may share some characteristics, such > +as CPU cache coherence, but may have different performance. For example, > +different media types and buses affect bandwidth and latency. > + > +A system supports such heterogeneous memory by grouping each memory type > +under different domains, or "nodes", based on locality and performance > +characteristics. Some memory may share the same node as a CPU, and others > +are provided as memory only nodes. While memory only nodes do not provide > +CPUs, they may still be local to one or more compute nodes relative to > +other nodes. The following diagram shows one such example of two compute > +nodes with local memory and a memory only node for each of compute node: > + > + +------------------+ +------------------+ > + | Compute Node 0 +-----+ Compute Node 1 | > + | Local Node0 Mem | | Local Node1 Mem | > + +--------+---------+ +--------+---------+ > + | | > + +--------+---------+ +--------+---------+ > + | Slower Node2 Mem | | Slower Node3 Mem | > + +------------------+ +--------+---------+ > + > +A "memory initiator" is a node containing one or more devices such as > +CPUs or separate memory I/O devices that can initiate memory requests. > +A "memory target" is a node containing one or more physical address > +ranges accessible from one or more memory initiators. > + > +When multiple memory initiators exist, they may not all have the same > +performance when accessing a given memory target. Each initiator-target > +pair may be organized into different ranked access classes to represent > +this relationship. This concept is a bit vague at the moment. Largely because only access0 is actually defined. We should definitely keep a close eye on any others that are defined in future to make sure this text is still valid. I can certainly see it being used for different ideas of 'best' rather than simply best and second best etc. > The highest performing initiator to a given target > +is considered to be one of that target's local initiators, and given > +the highest access class, 0. Any given target may have one or more > +local initiators, and any given initiator may have multiple local > +memory targets. > + > +To aid applications matching memory targets with their initiators, the > +kernel provides symlinks to each other. The following example lists the > +relationship for the access class "0" memory initiators and targets, which is > +the of nodes with the highest performing access relationship:: > + > + # symlinks -v /sys/devices/system/node/nodeX/access0/targets/ > + relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY So this one perhaps needs a bit more description - I would put it after initiators which precisely fits the description you have here now. "targets contains those nodes for which this initiator is the best possible initiator." which is subtly different form "targets contains those nodes to which this node has the highest performing access characteristics." For example in my test case: * 4 nodes with local memory and cpu, 1 node remote and equal distant from all of the initiators, targets for the compute nodes contains both themselves and the remote node, to which the characteristics are of course worse. As you point out before, we need to look in node0/access0/targets/node0/access0/initiators node0/access0/targets/node4/access0/initiators to get the relevant characteristics and work out that node0 is 'nearer' itself (obviously this is a bit of a silly case, but we could have no memory node0 and be talking about node4 and node5. I am happy with the actual interface, this is just a question about whether we can tweak this text to be slightly clearer. > + > + # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/ > + relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX > + > +================ > +NUMA Performance > +================ > + > +Applications may wish to consider which node they want their memory to > +be allocated from based on the node's performance characteristics. If > +the system provides these attributes, the kernel exports them under the > +node sysfs hierarchy by appending the attributes directory under the > +memory node's access class 0 initiators as follows:: > + > + /sys/devices/system/node/nodeY/access0/initiators/ > + > +These attributes apply only when accessed from nodes that have the > +are linked under the this access's inititiators. > + > +The performance characteristics the kernel provides for the local initiators > +are exported are as follows:: > + > + # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/ > + /sys/devices/system/node/nodeY/access0/initiators/ > + |-- read_bandwidth > + |-- read_latency > + |-- write_bandwidth > + `-- write_latency > + > +The bandwidth attributes are provided in MiB/second. > + > +The latency attributes are provided in nanoseconds. > + > +The values reported here correspond to the rated latency and bandwidth > +for the platform. > + > +========== > +NUMA Cache > +========== > + > +System memory may be constructed in a hierarchy of elements with various > +performance characteristics in order to provide large address space of > +slower performing memory cached by a smaller higher performing memory. The > +system physical addresses memory initiators are aware of are provided > +by the last memory level in the hierarchy. The system meanwhile uses > +higher performing memory to transparently cache access to progressively > +slower levels. > + > +The term "far memory" is used to denote the last level memory in the > +hierarchy. Each increasing cache level provides higher performing > +initiator access, and the term "near memory" represents the fastest > +cache provided by the system. > + > +This numbering is different than CPU caches where the cache level (ex: > +L1, L2, L3) uses the CPU-side view where each increased level is lower > +performing. In contrast, the memory cache level is centric to the last > +level memory, so the higher numbered cache level corresponds to memory > +nearer to the CPU, and further from far memory. > + > +The memory-side caches are not directly addressable by software. When > +software accesses a system address, the system will return it from the > +near memory cache if it is present. If it is not present, the system > +accesses the next level of memory until there is either a hit in that > +cache level, or it reaches far memory. > + > +An application does not need to know about caching attributes in order > +to use the system. Software may optionally query the memory cache > +attributes in order to maximize the performance out of such a setup. > +If the system provides a way for the kernel to discover this information, > +for example with ACPI HMAT (Heterogeneous Memory Attribute Table), > +the kernel will append these attributes to the NUMA node memory target. > + > +When the kernel first registers a memory cache with a node, the kernel > +will create the following directory:: Real nitpick but more precisely, "If relevant, the kernel..." Otherwise we say it's always there but then say it isn't below. > + > + /sys/devices/system/node/nodeX/memory_side_cache/ > + > +If that directory is not present, the system either does not not provide > +a memory-side cache, or that information is not accessible to the kernel. > + > +The attributes for each level of cache is provided under its cache > +level index:: > + > + /sys/devices/system/node/nodeX/memory_side_cache/indexA/ > + /sys/devices/system/node/nodeX/memory_side_cache/indexB/ > + /sys/devices/system/node/nodeX/memory_side_cache/indexC/ > + > +Each cache level's directory provides its attributes. For example, the > +following shows a single cache level and the attributes available for > +software to query:: > + > + # tree sys/devices/system/node/node0/memory_side_cache/ > + /sys/devices/system/node/node0/memory_side_cache/ > + |-- index1 > + | |-- indexing > + | |-- line_size > + | |-- size > + | `-- write_policy > + > +The "indexing" will be 0 if it is a direct-mapped cache, and non-zero > +for any other indexed based, multi-way associativity. > + > +The "line_size" is the number of bytes accessed from the next cache > +level on a miss. > + > +The "size" is the number of bytes provided by this cache level. > + > +The "write_policy" will be 0 for write-back, and non-zero for > +write-through caching. > + > +======== > +See Also > +======== > +.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf > + Section 5.2.27
WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron <jonathan.cameron@huawei.com> To: Keith Busch <keith.busch@intel.com> Cc: <linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>, <linux-mm@kvack.org>, <linux-api@vger.kernel.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Rafael Wysocki <rafael@kernel.org>, "Dave Hansen" <dave.hansen@intel.com>, Dan Williams <dan.j.williams@intel.com> Subject: Re: [PATCHv7 10/10] doc/mm: New documentation for memory performance Date: Mon, 11 Mar 2019 11:38:43 +0000 [thread overview] Message-ID: <20190311113843.00006b47@huawei.com> (raw) In-Reply-To: <20190227225038.20438-11-keith.busch@intel.com> On Wed, 27 Feb 2019 15:50:38 -0700 Keith Busch <keith.busch@intel.com> wrote: > Platforms may provide system memory where some physical address ranges > perform differently than others, or is side cached by the system. The magic 'side cached' term still here in the patch description, ideally wants cleaning up. > > Add documentation describing a high level overview of such systems and the > perforamnce and caching attributes the kernel provides for applications performance > wishing to query this information. > > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> > Signed-off-by: Keith Busch <keith.busch@intel.com> A few comments inline. Mostly the weird corner cases that I miss understood in one of the earlier versions of the code. Whilst I think perhaps that one section could be tweaked a tiny bit I'm basically happy with this if you don't want to. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > Documentation/admin-guide/mm/numaperf.rst | 164 ++++++++++++++++++++++++++++++ > 1 file changed, 164 insertions(+) > create mode 100644 Documentation/admin-guide/mm/numaperf.rst > > diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst > new file mode 100644 > index 000000000000..d32756b9be48 > --- /dev/null > +++ b/Documentation/admin-guide/mm/numaperf.rst > @@ -0,0 +1,164 @@ > +.. _numaperf: > + > +============= > +NUMA Locality > +============= > + > +Some platforms may have multiple types of memory attached to a compute > +node. These disparate memory ranges may share some characteristics, such > +as CPU cache coherence, but may have different performance. For example, > +different media types and buses affect bandwidth and latency. > + > +A system supports such heterogeneous memory by grouping each memory type > +under different domains, or "nodes", based on locality and performance > +characteristics. Some memory may share the same node as a CPU, and others > +are provided as memory only nodes. While memory only nodes do not provide > +CPUs, they may still be local to one or more compute nodes relative to > +other nodes. The following diagram shows one such example of two compute > +nodes with local memory and a memory only node for each of compute node: > + > + +------------------+ +------------------+ > + | Compute Node 0 +-----+ Compute Node 1 | > + | Local Node0 Mem | | Local Node1 Mem | > + +--------+---------+ +--------+---------+ > + | | > + +--------+---------+ +--------+---------+ > + | Slower Node2 Mem | | Slower Node3 Mem | > + +------------------+ +--------+---------+ > + > +A "memory initiator" is a node containing one or more devices such as > +CPUs or separate memory I/O devices that can initiate memory requests. > +A "memory target" is a node containing one or more physical address > +ranges accessible from one or more memory initiators. > + > +When multiple memory initiators exist, they may not all have the same > +performance when accessing a given memory target. Each initiator-target > +pair may be organized into different ranked access classes to represent > +this relationship. This concept is a bit vague at the moment. Largely because only access0 is actually defined. We should definitely keep a close eye on any others that are defined in future to make sure this text is still valid. I can certainly see it being used for different ideas of 'best' rather than simply best and second best etc. > The highest performing initiator to a given target > +is considered to be one of that target's local initiators, and given > +the highest access class, 0. Any given target may have one or more > +local initiators, and any given initiator may have multiple local > +memory targets. > + > +To aid applications matching memory targets with their initiators, the > +kernel provides symlinks to each other. The following example lists the > +relationship for the access class "0" memory initiators and targets, which is > +the of nodes with the highest performing access relationship:: > + > + # symlinks -v /sys/devices/system/node/nodeX/access0/targets/ > + relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY So this one perhaps needs a bit more description - I would put it after initiators which precisely fits the description you have here now. "targets contains those nodes for which this initiator is the best possible initiator." which is subtly different form "targets contains those nodes to which this node has the highest performing access characteristics." For example in my test case: * 4 nodes with local memory and cpu, 1 node remote and equal distant from all of the initiators, targets for the compute nodes contains both themselves and the remote node, to which the characteristics are of course worse. As you point out before, we need to look in node0/access0/targets/node0/access0/initiators node0/access0/targets/node4/access0/initiators to get the relevant characteristics and work out that node0 is 'nearer' itself (obviously this is a bit of a silly case, but we could have no memory node0 and be talking about node4 and node5. I am happy with the actual interface, this is just a question about whether we can tweak this text to be slightly clearer. > + > + # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/ > + relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX > + > +================ > +NUMA Performance > +================ > + > +Applications may wish to consider which node they want their memory to > +be allocated from based on the node's performance characteristics. If > +the system provides these attributes, the kernel exports them under the > +node sysfs hierarchy by appending the attributes directory under the > +memory node's access class 0 initiators as follows:: > + > + /sys/devices/system/node/nodeY/access0/initiators/ > + > +These attributes apply only when accessed from nodes that have the > +are linked under the this access's inititiators. > + > +The performance characteristics the kernel provides for the local initiators > +are exported are as follows:: > + > + # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/ > + /sys/devices/system/node/nodeY/access0/initiators/ > + |-- read_bandwidth > + |-- read_latency > + |-- write_bandwidth > + `-- write_latency > + > +The bandwidth attributes are provided in MiB/second. > + > +The latency attributes are provided in nanoseconds. > + > +The values reported here correspond to the rated latency and bandwidth > +for the platform. > + > +========== > +NUMA Cache > +========== > + > +System memory may be constructed in a hierarchy of elements with various > +performance characteristics in order to provide large address space of > +slower performing memory cached by a smaller higher performing memory. The > +system physical addresses memory initiators are aware of are provided > +by the last memory level in the hierarchy. The system meanwhile uses > +higher performing memory to transparently cache access to progressively > +slower levels. > + > +The term "far memory" is used to denote the last level memory in the > +hierarchy. Each increasing cache level provides higher performing > +initiator access, and the term "near memory" represents the fastest > +cache provided by the system. > + > +This numbering is different than CPU caches where the cache level (ex: > +L1, L2, L3) uses the CPU-side view where each increased level is lower > +performing. In contrast, the memory cache level is centric to the last > +level memory, so the higher numbered cache level corresponds to memory > +nearer to the CPU, and further from far memory. > + > +The memory-side caches are not directly addressable by software. When > +software accesses a system address, the system will return it from the > +near memory cache if it is present. If it is not present, the system > +accesses the next level of memory until there is either a hit in that > +cache level, or it reaches far memory. > + > +An application does not need to know about caching attributes in order > +to use the system. Software may optionally query the memory cache > +attributes in order to maximize the performance out of such a setup. > +If the system provides a way for the kernel to discover this information, > +for example with ACPI HMAT (Heterogeneous Memory Attribute Table), > +the kernel will append these attributes to the NUMA node memory target. > + > +When the kernel first registers a memory cache with a node, the kernel > +will create the following directory:: Real nitpick but more precisely, "If relevant, the kernel..." Otherwise we say it's always there but then say it isn't below. > + > + /sys/devices/system/node/nodeX/memory_side_cache/ > + > +If that directory is not present, the system either does not not provide > +a memory-side cache, or that information is not accessible to the kernel. > + > +The attributes for each level of cache is provided under its cache > +level index:: > + > + /sys/devices/system/node/nodeX/memory_side_cache/indexA/ > + /sys/devices/system/node/nodeX/memory_side_cache/indexB/ > + /sys/devices/system/node/nodeX/memory_side_cache/indexC/ > + > +Each cache level's directory provides its attributes. For example, the > +following shows a single cache level and the attributes available for > +software to query:: > + > + # tree sys/devices/system/node/node0/memory_side_cache/ > + /sys/devices/system/node/node0/memory_side_cache/ > + |-- index1 > + | |-- indexing > + | |-- line_size > + | |-- size > + | `-- write_policy > + > +The "indexing" will be 0 if it is a direct-mapped cache, and non-zero > +for any other indexed based, multi-way associativity. > + > +The "line_size" is the number of bytes accessed from the next cache > +level on a miss. > + > +The "size" is the number of bytes provided by this cache level. > + > +The "write_policy" will be 0 for write-back, and non-zero for > +write-through caching. > + > +======== > +See Also > +======== > +.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf > + Section 5.2.27
next prev parent reply other threads:[~2019-03-11 11:38 UTC|newest] Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-02-27 22:50 [PATCHv7 00/10] Heterogenous memory node attributes Keith Busch 2019-02-27 22:50 ` [PATCHv7 01/10] acpi: Create subtable parsing infrastructure Keith Busch 2019-02-27 22:50 ` [PATCHv7 02/10] acpi: Add HMAT to generic parsing tables Keith Busch 2019-02-27 22:50 ` [PATCHv7 03/10] acpi/hmat: Parse and report heterogeneous memory Keith Busch 2019-03-08 17:25 ` Jonathan Cameron 2019-03-08 17:25 ` Jonathan Cameron 2019-03-11 10:28 ` Jonathan Cameron 2019-03-11 10:28 ` Jonathan Cameron 2019-02-27 22:50 ` [PATCHv7 04/10] node: Link memory nodes to their compute nodes Keith Busch 2019-03-11 10:34 ` Jonathan Cameron 2019-03-11 10:34 ` Jonathan Cameron 2019-02-27 22:50 ` [PATCHv7 05/10] node: Add heterogenous memory access attributes Keith Busch 2019-02-27 22:50 ` [PATCHv7 06/10] node: Add memory-side caching attributes Keith Busch 2019-03-08 16:21 ` Jonathan Cameron 2019-03-08 16:21 ` Jonathan Cameron 2019-02-27 22:50 ` [PATCHv7 07/10] acpi/hmat: Register processor domain to its memory Keith Busch 2019-03-11 11:20 ` Jonathan Cameron 2019-03-11 11:20 ` Jonathan Cameron 2019-03-11 19:52 ` Keith Busch 2019-02-27 22:50 ` [PATCHv7 08/10] acpi/hmat: Register performance attributes Keith Busch 2019-03-11 11:21 ` Jonathan Cameron 2019-03-11 11:21 ` Jonathan Cameron 2019-02-27 22:50 ` [PATCHv7 09/10] acpi/hmat: Register memory side cache attributes Keith Busch 2019-02-27 22:50 ` [PATCHv7 10/10] doc/mm: New documentation for memory performance Keith Busch 2019-03-11 11:38 ` Jonathan Cameron [this message] 2019-03-11 11:38 ` Jonathan Cameron 2019-03-11 20:16 ` Keith Busch 2019-03-12 13:37 ` Jonathan Cameron 2019-03-11 11:47 ` [PATCHv7 00/10] Heterogenous memory node attributes Jonathan Cameron 2019-03-11 11:47 ` Jonathan Cameron
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190311113843.00006b47@huawei.com \ --to=jonathan.cameron@huawei.com \ --cc=dan.j.williams@intel.com \ --cc=dave.hansen@intel.com \ --cc=gregkh@linuxfoundation.org \ --cc=keith.busch@intel.com \ --cc=linux-acpi@vger.kernel.org \ --cc=linux-api@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=rafael@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.