From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id A32392219BCBC for ; Wed, 20 Dec 2017 13:09:04 -0800 (PST) Date: Wed, 20 Dec 2017 14:13:50 -0700 From: Ross Zwisler Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT Message-ID: <20171220211350.GA2688@linux.intel.com> References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Matthew Wilcox Cc: Michal Hocko , "Box, David E" , Dave Hansen , "Zheng, Lv" , linux-nvdimm@lists.01.org, "Rafael J. Wysocki" , "Anaczkowski, Lukasz" , Moore,Lukasz, "Erik , Len Brown" , John Hubbard , linuxppc-dev@lists.ozlabs.org, Jerome Glisse , devel@acpica.org, "Kogut, Jaroslaw" , linux-mm@kvack.org, "Koss, Marcin" , linux-api@vger.kernel.org, Brice Goglin , "Nachimuthu, Murugasamy , Rafael J. Wysocki" , linux-kernel@vger.kernel.org, Koziej,, "Joonas , Andrew Morton , Tim Chen" List-ID: On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > What I'm hoping to do with this series is to just provide a sysfs > > representation of the HMAT so that applications can know which NUMA nodes to > > select with existing utilities like numactl. This series does not currently > > alter any kernel behavior, it only provides a sysfs interface. > > > > Say for example you had a system with some high bandwidth memory (HBM), and > > you wanted to use it for a specific application. You could use the sysfs > > representation of the HMAT to figure out which memory target held your HBM. > > You could do this by looking at the local bandwidth values for the various > > memory targets, so: > > > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > > > and look for the one that corresponds to your HBM speed. (These numbers are > > made up, but you get the idea.) > > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Hey Matthew, Yep, this is where I started as well. My plan with my initial implementation was to try and make the sysfs representation as platform agnostic as possible, and just have the ACPI HMAT as one of the many places to gather the data needed to populate sysfs. However, as I began coding the implementation became very specific to the HMAT, probably because I don't know of way that this type of info is represented on another platform. John Hubbard noticed the same thing and asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to prevent it from being confused with the HMM work: https://lkml.org/lkml/2017/7/7/33 https://lkml.org/lkml/2017/7/7/442 I'm open to making it more platform agnostic if I can get my hands on a parallel effort in another platform and tease out the commonality, but trying to do that without a second example hasn't worked out. If we don't have a good second example right now I think maybe we should put this in and then merge it with the second example when it comes along. > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). > > > Once you know the NUMA node of your HBM, you can figure out the NUMA node of > > it's local initiator: > > > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > > > So, in our made-up example our HBM is located in numa node 2, and the local > > CPU for that HBM is at numa node 0. > > initiator is a CPU? I'd have expected you to expose a memory controller > abstraction rather than re-use storage terminology. Yea, I agree that at first blush it seems weird. It turns out that looking at it in sort of a storage initiator/target way is beneficial, though, because it allows us to cut down on the number of data values we need to represent. For example the SLIT, which doesn't differentiate between initiator and target proximity domains (and thus nodes) always represents a system with N proximity domains using a NxN distance table. This makes sense if every node contains both CPUs and memory. With the introduction of the HMAT, though, we can have memory-only initiator nodes and we can explicitly associate them with their local CPU. This is necessary so that we can separate memory with different performance characteristics (HBM vs normal memory vs persistent memory, for example) that are all attached to the same CPU. So, say we now have a system with 4 CPUs, and each of those CPUs has 3 different types of memory attached to it. We now have 16 total proximity domains, 4 CPU and 12 memory. If we represent this with the SLIT we end up with a 16 X 16 distance table (256 entries), most of which don't matter because they are memory-to-memory distances which don't make sense. In the HMAT, though, we separate out the initiators and the targets and put them into separate lists. (See 5.2.27.4 System Locality Latency and Bandwidth Information Structure in ACPI 6.2 for details.) So, this same config in the HMAT only has 4*12=48 performance values of each type, all of which convey meaningful information. The HMAT indeed even uses the storage "initiator" and "target" terminology. :) _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT Date: Wed, 20 Dec 2017 14:13:50 -0700 Message-ID: <20171220211350.GA2688@linux.intel.com> References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from mga09.intel.com ([134.134.136.24]:63627 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755763AbdLTVNw (ORCPT ); Wed, 20 Dec 2017 16:13:52 -0500 Content-Disposition: inline In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org> Sender: linux-acpi-owner@vger.kernel.org List-Id: linux-acpi@vger.kernel.org To: Matthew Wilcox Cc: Ross Zwisler , Michal Hocko , linux-kernel@vger.kernel.org, "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Koss, Marcin" , "Koziej, Artur" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > What I'm hoping to do with this series is to just provide a sysfs > > representation of the HMAT so that applications can know which NUMA nodes to > > select with existing utilities like numactl. This series does not currently > > alter any kernel behavior, it only provides a sysfs interface. > > > > Say for example you had a system with some high bandwidth memory (HBM), and > > you wanted to use it for a specific application. You could use the sysfs > > representation of the HMAT to figure out which memory target held your HBM. > > You could do this by looking at the local bandwidth values for the various > > memory targets, so: > > > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > > > and look for the one that corresponds to your HBM speed. (These numbers are > > made up, but you get the idea.) > > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Hey Matthew, Yep, this is where I started as well. My plan with my initial implementation was to try and make the sysfs representation as platform agnostic as possible, and just have the ACPI HMAT as one of the many places to gather the data needed to populate sysfs. However, as I began coding the implementation became very specific to the HMAT, probably because I don't know of way that this type of info is represented on another platform. John Hubbard noticed the same thing and asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to prevent it from being confused with the HMM work: https://lkml.org/lkml/2017/7/7/33 https://lkml.org/lkml/2017/7/7/442 I'm open to making it more platform agnostic if I can get my hands on a parallel effort in another platform and tease out the commonality, but trying to do that without a second example hasn't worked out. If we don't have a good second example right now I think maybe we should put this in and then merge it with the second example when it comes along. > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). > > > Once you know the NUMA node of your HBM, you can figure out the NUMA node of > > it's local initiator: > > > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > > > So, in our made-up example our HBM is located in numa node 2, and the local > > CPU for that HBM is at numa node 0. > > initiator is a CPU? I'd have expected you to expose a memory controller > abstraction rather than re-use storage terminology. Yea, I agree that at first blush it seems weird. It turns out that looking at it in sort of a storage initiator/target way is beneficial, though, because it allows us to cut down on the number of data values we need to represent. For example the SLIT, which doesn't differentiate between initiator and target proximity domains (and thus nodes) always represents a system with N proximity domains using a NxN distance table. This makes sense if every node contains both CPUs and memory. With the introduction of the HMAT, though, we can have memory-only initiator nodes and we can explicitly associate them with their local CPU. This is necessary so that we can separate memory with different performance characteristics (HBM vs normal memory vs persistent memory, for example) that are all attached to the same CPU. So, say we now have a system with 4 CPUs, and each of those CPUs has 3 different types of memory attached to it. We now have 16 total proximity domains, 4 CPU and 12 memory. If we represent this with the SLIT we end up with a 16 X 16 distance table (256 entries), most of which don't matter because they are memory-to-memory distances which don't make sense. In the HMAT, though, we separate out the initiators and the targets and put them into separate lists. (See 5.2.27.4 System Locality Latency and Bandwidth Information Structure in ACPI 6.2 for details.) So, this same config in the HMAT only has 4*12=48 performance values of each type, all of which convey meaningful information. The HMAT indeed even uses the storage "initiator" and "target" terminology. :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756577AbdLTVN4 (ORCPT ); Wed, 20 Dec 2017 16:13:56 -0500 Received: from mga09.intel.com ([134.134.136.24]:63627 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755763AbdLTVNw (ORCPT ); Wed, 20 Dec 2017 16:13:52 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,433,1508828400"; d="scan'208";a="160516302" Date: Wed, 20 Dec 2017 14:13:50 -0700 From: Ross Zwisler To: Matthew Wilcox Cc: Ross Zwisler , Michal Hocko , linux-kernel@vger.kernel.org, "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Koss, Marcin" , "Koziej, Artur" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton , Balbir Singh , Brice Goglin , Dan Williams , Dave Hansen , Jerome Glisse , John Hubbard , Len Brown , Tim Chen , devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-api@vger.kernel.org, linuxppc-dev@lists.ozlabs.org Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT Message-ID: <20171220211350.GA2688@linux.intel.com> References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > What I'm hoping to do with this series is to just provide a sysfs > > representation of the HMAT so that applications can know which NUMA nodes to > > select with existing utilities like numactl. This series does not currently > > alter any kernel behavior, it only provides a sysfs interface. > > > > Say for example you had a system with some high bandwidth memory (HBM), and > > you wanted to use it for a specific application. You could use the sysfs > > representation of the HMAT to figure out which memory target held your HBM. > > You could do this by looking at the local bandwidth values for the various > > memory targets, so: > > > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > > > and look for the one that corresponds to your HBM speed. (These numbers are > > made up, but you get the idea.) > > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Hey Matthew, Yep, this is where I started as well. My plan with my initial implementation was to try and make the sysfs representation as platform agnostic as possible, and just have the ACPI HMAT as one of the many places to gather the data needed to populate sysfs. However, as I began coding the implementation became very specific to the HMAT, probably because I don't know of way that this type of info is represented on another platform. John Hubbard noticed the same thing and asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to prevent it from being confused with the HMM work: https://lkml.org/lkml/2017/7/7/33 https://lkml.org/lkml/2017/7/7/442 I'm open to making it more platform agnostic if I can get my hands on a parallel effort in another platform and tease out the commonality, but trying to do that without a second example hasn't worked out. If we don't have a good second example right now I think maybe we should put this in and then merge it with the second example when it comes along. > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). > > > Once you know the NUMA node of your HBM, you can figure out the NUMA node of > > it's local initiator: > > > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > > > So, in our made-up example our HBM is located in numa node 2, and the local > > CPU for that HBM is at numa node 0. > > initiator is a CPU? I'd have expected you to expose a memory controller > abstraction rather than re-use storage terminology. Yea, I agree that at first blush it seems weird. It turns out that looking at it in sort of a storage initiator/target way is beneficial, though, because it allows us to cut down on the number of data values we need to represent. For example the SLIT, which doesn't differentiate between initiator and target proximity domains (and thus nodes) always represents a system with N proximity domains using a NxN distance table. This makes sense if every node contains both CPUs and memory. With the introduction of the HMAT, though, we can have memory-only initiator nodes and we can explicitly associate them with their local CPU. This is necessary so that we can separate memory with different performance characteristics (HBM vs normal memory vs persistent memory, for example) that are all attached to the same CPU. So, say we now have a system with 4 CPUs, and each of those CPUs has 3 different types of memory attached to it. We now have 16 total proximity domains, 4 CPU and 12 memory. If we represent this with the SLIT we end up with a 16 X 16 distance table (256 entries), most of which don't matter because they are memory-to-memory distances which don't make sense. In the HMAT, though, we separate out the initiators and the targets and put them into separate lists. (See 5.2.27.4 System Locality Latency and Bandwidth Information Structure in ACPI 6.2 for details.) So, this same config in the HMAT only has 4*12=48 performance values of each type, all of which convey meaningful information. The HMAT indeed even uses the storage "initiator" and "target" terminology. :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f71.google.com (mail-pl0-f71.google.com [209.85.160.71]) by kanga.kvack.org (Postfix) with ESMTP id E332D6B0038 for ; Wed, 20 Dec 2017 16:13:54 -0500 (EST) Received: by mail-pl0-f71.google.com with SMTP id x1so10168202plb.2 for ; Wed, 20 Dec 2017 13:13:54 -0800 (PST) Received: from mga05.intel.com (mga05.intel.com. [192.55.52.43]) by mx.google.com with ESMTPS id h5si12483353pgv.48.2017.12.20.13.13.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 20 Dec 2017 13:13:53 -0800 (PST) Date: Wed, 20 Dec 2017 14:13:50 -0700 From: Ross Zwisler Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT Message-ID: <20171220211350.GA2688@linux.intel.com> References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org> Sender: owner-linux-mm@kvack.org List-ID: To: Matthew Wilcox Cc: Ross Zwisler , Michal Hocko , linux-kernel@vger.kernel.org, "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Koss, Marcin" , "Koziej, Artur" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton , Balbir Singh , Brice Goglin , Dan Williams , Dave Hansen , Jerome Glisse , John Hubbard , Len Brown , Tim Chen , devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-api@vger.kernel.org, linuxppc-dev@lists.ozlabs.org On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > What I'm hoping to do with this series is to just provide a sysfs > > representation of the HMAT so that applications can know which NUMA nodes to > > select with existing utilities like numactl. This series does not currently > > alter any kernel behavior, it only provides a sysfs interface. > > > > Say for example you had a system with some high bandwidth memory (HBM), and > > you wanted to use it for a specific application. You could use the sysfs > > representation of the HMAT to figure out which memory target held your HBM. > > You could do this by looking at the local bandwidth values for the various > > memory targets, so: > > > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > > > and look for the one that corresponds to your HBM speed. (These numbers are > > made up, but you get the idea.) > > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Hey Matthew, Yep, this is where I started as well. My plan with my initial implementation was to try and make the sysfs representation as platform agnostic as possible, and just have the ACPI HMAT as one of the many places to gather the data needed to populate sysfs. However, as I began coding the implementation became very specific to the HMAT, probably because I don't know of way that this type of info is represented on another platform. John Hubbard noticed the same thing and asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to prevent it from being confused with the HMM work: https://lkml.org/lkml/2017/7/7/33 https://lkml.org/lkml/2017/7/7/442 I'm open to making it more platform agnostic if I can get my hands on a parallel effort in another platform and tease out the commonality, but trying to do that without a second example hasn't worked out. If we don't have a good second example right now I think maybe we should put this in and then merge it with the second example when it comes along. > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). > > > Once you know the NUMA node of your HBM, you can figure out the NUMA node of > > it's local initiator: > > > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > > > So, in our made-up example our HBM is located in numa node 2, and the local > > CPU for that HBM is at numa node 0. > > initiator is a CPU? I'd have expected you to expose a memory controller > abstraction rather than re-use storage terminology. Yea, I agree that at first blush it seems weird. It turns out that looking at it in sort of a storage initiator/target way is beneficial, though, because it allows us to cut down on the number of data values we need to represent. For example the SLIT, which doesn't differentiate between initiator and target proximity domains (and thus nodes) always represents a system with N proximity domains using a NxN distance table. This makes sense if every node contains both CPUs and memory. With the introduction of the HMAT, though, we can have memory-only initiator nodes and we can explicitly associate them with their local CPU. This is necessary so that we can separate memory with different performance characteristics (HBM vs normal memory vs persistent memory, for example) that are all attached to the same CPU. So, say we now have a system with 4 CPUs, and each of those CPUs has 3 different types of memory attached to it. We now have 16 total proximity domains, 4 CPU and 12 memory. If we represent this with the SLIT we end up with a 16 X 16 distance table (256 entries), most of which don't matter because they are memory-to-memory distances which don't make sense. In the HMAT, though, we separate out the initiators and the targets and put them into separate lists. (See 5.2.27.4 System Locality Latency and Bandwidth Information Structure in ACPI 6.2 for details.) So, this same config in the HMAT only has 4*12=48 performance values of each type, all of which convey meaningful information. The HMAT indeed even uses the storage "initiator" and "target" terminology. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============9134330909207175583==" MIME-Version: 1.0 From: Ross Zwisler Subject: Re: [Devel] [PATCH v3 0/3] create sysfs representation of ACPI HMAT Date: Wed, 20 Dec 2017 14:13:50 -0700 Message-ID: <20171220211350.GA2688@linux.intel.com> In-Reply-To: 20171220181937.GB12236@bombadil.infradead.org List-ID: To: devel@acpica.org --===============9134330909207175583== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > What I'm hoping to do with this series is to just provide a sysfs > > representation of the HMAT so that applications can know which NUMA nod= es to > > select with existing utilities like numactl. This series does not curr= ently > > alter any kernel behavior, it only provides a sysfs interface. > > = > > Say for example you had a system with some high bandwidth memory (HBM),= and > > you wanted to use it for a specific application. You could use the sys= fs > > representation of the HMAT to figure out which memory target held your = HBM. > > You could do this by looking at the local bandwidth values for the vari= ous > > memory targets, so: > > = > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > = > > and look for the one that corresponds to your HBM speed. (These numbers= are > > made up, but you get the idea.) > = > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Hey Matthew, Yep, this is where I started as well. My plan with my initial implementati= on was to try and make the sysfs representation as platform agnostic as possib= le, and just have the ACPI HMAT as one of the many places to gather the data needed to populate sysfs. However, as I began coding the implementation became very specific to the HMAT, probably because I don't know of way that this type of info is represented on another platform. John Hubbard noticed the same thing and asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to prevent it from being confused with the HMM work: https://lkml.org/lkml/2017/7/7/33 https://lkml.org/lkml/2017/7/7/442 I'm open to making it more platform agnostic if I can get my hands on a parallel effort in another platform and tease out the commonality, but tryi= ng to do that without a second example hasn't worked out. If we don't have a good second example right now I think maybe we should put this in and then merge it with the second example when it comes along. > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). > = > > Once you know the NUMA node of your HBM, you can figure out the NUMA no= de of > > it's local initiator: > > = > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > = > > So, in our made-up example our HBM is located in numa node 2, and the l= ocal > > CPU for that HBM is at numa node 0. > = > initiator is a CPU? I'd have expected you to expose a memory controller > abstraction rather than re-use storage terminology. Yea, I agree that at first blush it seems weird. It turns out that looking= at it in sort of a storage initiator/target way is beneficial, though, because= it allows us to cut down on the number of data values we need to represent. For example the SLIT, which doesn't differentiate between initiator and tar= get proximity domains (and thus nodes) always represents a system with N proxim= ity domains using a NxN distance table. This makes sense if every node contains both CPUs and memory. With the introduction of the HMAT, though, we can have memory-only initiator nodes and we can explicitly associate them with their local CPU. This is necessary so that we can separate memory with different performance characteristics (HBM vs normal memory vs persistent memory, for example) th= at are all attached to the same CPU. So, say we now have a system with 4 CPUs, and each of those CPUs has 3 different types of memory attached to it. We now have 16 total proximity domains, 4 CPU and 12 memory. If we represent this with the SLIT we end up with a 16 X 16 distance table (256 entries), most of which don't matter because they are memory-to-memory distances which don't make sense. In the HMAT, though, we separate out the initiators and the targets and put them into separate lists. (See 5.2.27.4 System Locality Latency and Bandwi= dth Information Structure in ACPI 6.2 for details.) So, this same config in the HMAT only has 4*12=3D48 performance values of each type, all of which convey meaningful information. The HMAT indeed even uses the storage "initiator" and "target" terminology.= :) --===============9134330909207175583==--