From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-nvdimm-bounces@lists.01.org>
Received: from mga01.intel.com (mga01.intel.com [192.55.52.88])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by ml01.01.org (Postfix) with ESMTPS id A32392219BCBC
 for <linux-nvdimm@lists.01.org>; Wed, 20 Dec 2017 13:09:04 -0800 (PST)
Date: Wed, 20 Dec 2017 14:13:50 -0700
From: Ross Zwisler <ross.zwisler@linux.intel.com>
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Message-ID: <20171220211350.GA2688@linux.intel.com>
References: <20171214021019.13579-1-ross.zwisler@linux.intel.com>
 <20171214130032.GK16951@dhcp22.suse.cz>
 <20171218203547.GA2366@linux.intel.com>
 <20171220181937.GB12236@bombadil.infradead.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/linux-nvdimm>,
 <mailto:linux-nvdimm-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/linux-nvdimm/>
List-Post: <mailto:linux-nvdimm@lists.01.org>
List-Help: <mailto:linux-nvdimm-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/linux-nvdimm>,
 <mailto:linux-nvdimm-request@lists.01.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: linux-nvdimm-bounces@lists.01.org
Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>, "Box, David E" <david.e.box@intel.com>, Dave Hansen <dave.hansen@intel.com>, "Zheng, Lv" <lv.zheng@intel.com>, linux-nvdimm@lists.01.org, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>, "Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>, Moore,Lukasz, "Erik  <erik.schmauss@intel.com>, Len Brown" <lenb@kernel.org>, John Hubbard <jhubbard@nvidia.com>, linuxppc-dev@lists.ozlabs.org, Jerome Glisse <jglisse@redhat.com>, devel@acpica.org, "Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>, linux-mm@kvack.org, "Koss, Marcin" <marcin.koss@intel.com>, linux-api@vger.kernel.org, Brice Goglin <brice.goglin@gmail.com>, "Nachimuthu, Murugasamy  <murugasamy.nachimuthu@intel.com>, Rafael J. Wysocki" <rjw@rjwysocki.net>, linux-kernel@vger.kernel.org, Koziej,, "Joonas  <joonas.lahtinen@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Tim Chen" <tim.c.chen@linux.intel.com>
List-ID: <linux-nvdimm@lists.01.org>

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ross Zwisler <ross.zwisler@linux.intel.com>
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Date: Wed, 20 Dec 2017 14:13:50 -0700
Message-ID: <20171220211350.GA2688@linux.intel.com>
References: <20171214021019.13579-1-ross.zwisler@linux.intel.com>
 <20171214130032.GK16951@dhcp22.suse.cz>
 <20171218203547.GA2366@linux.intel.com>
 <20171220181937.GB12236@bombadil.infradead.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Return-path: <linux-acpi-owner@vger.kernel.org>
Received: from mga09.intel.com ([134.134.136.24]:63627 "EHLO mga09.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1755763AbdLTVNw (ORCPT <rfc822;linux-acpi@vger.kernel.org>);
        Wed, 20 Dec 2017 16:13:52 -0500
Content-Disposition: inline
In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org>
Sender: linux-acpi-owner@vger.kernel.org
List-Id: linux-acpi@vger.kernel.org
To: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>, Michal Hocko <mhocko@kernel.org>, linux-kernel@vger.kernel.org, "Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>, "Box, David E" <david.e.box@intel.com>, "Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>, "Koss, Marcin" <marcin.koss@intel.com>, "Koziej, Artur" <artur.koziej@intel.com>, "Lahtinen, Joonas" <joonas.lahtinen@intel.com>, "Moore, Robert" <robert.moore@intel.com>, "Nachimuthu, Murugasamy" <murugasamy.nachimuthu@intel.com>, "Odzioba, Lukasz" <lukasz.odzioba@intel.com>, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, "Schmauss, Erik" <erik.schmauss@intel.com>, "Verma, Vishal L" <vishal.l.verma@intel.com>, "Zheng, Lv" <lv.zheng@intel.com>, Andrew Morton <akpm@linux->

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1756577AbdLTVN4 (ORCPT <rfc822;w@1wt.eu>);
        Wed, 20 Dec 2017 16:13:56 -0500
Received: from mga09.intel.com ([134.134.136.24]:63627 "EHLO mga09.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1755763AbdLTVNw (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Dec 2017 16:13:52 -0500
X-Amp-Result: UNSCANNABLE
X-Amp-File-Uploaded: False
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.45,433,1508828400"; 
   d="scan'208";a="160516302"
Date: Wed, 20 Dec 2017 14:13:50 -0700
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
        Michal Hocko <mhocko@kernel.org>, linux-kernel@vger.kernel.org,
        "Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>,
        "Box, David E" <david.e.box@intel.com>,
        "Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>,
        "Koss, Marcin" <marcin.koss@intel.com>,
        "Koziej, Artur" <artur.koziej@intel.com>,
        "Lahtinen, Joonas" <joonas.lahtinen@intel.com>,
        "Moore, Robert" <robert.moore@intel.com>,
        "Nachimuthu, Murugasamy" <murugasamy.nachimuthu@intel.com>,
        "Odzioba, Lukasz" <lukasz.odzioba@intel.com>,
        "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
        "Rafael J. Wysocki" <rjw@rjwysocki.net>,
        "Schmauss, Erik" <erik.schmauss@intel.com>,
        "Verma, Vishal L" <vishal.l.verma@intel.com>,
        "Zheng, Lv" <lv.zheng@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Balbir Singh <bsingharora@gmail.com>,
        Brice Goglin <brice.goglin@gmail.com>,
        Dan Williams <dan.j.williams@intel.com>,
        Dave Hansen <dave.hansen@intel.com>,
        Jerome Glisse <jglisse@redhat.com>, John Hubbard <jhubbard@nvidia.com>,
        Len Brown <lenb@kernel.org>, Tim Chen <tim.c.chen@linux.intel.com>,
        devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org,
        linux-nvdimm@lists.01.org, linux-api@vger.kernel.org,
        linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Message-ID: <20171220211350.GA2688@linux.intel.com>
References: <20171214021019.13579-1-ross.zwisler@linux.intel.com>
 <20171214130032.GK16951@dhcp22.suse.cz>
 <20171218203547.GA2366@linux.intel.com>
 <20171220181937.GB12236@bombadil.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org>
User-Agent: Mutt/1.9.1 (2017-09-22)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail-pl0-f71.google.com (mail-pl0-f71.google.com [209.85.160.71])
	by kanga.kvack.org (Postfix) with ESMTP id E332D6B0038
	for <linux-mm@kvack.org>; Wed, 20 Dec 2017 16:13:54 -0500 (EST)
Received: by mail-pl0-f71.google.com with SMTP id x1so10168202plb.2
        for <linux-mm@kvack.org>; Wed, 20 Dec 2017 13:13:54 -0800 (PST)
Received: from mga05.intel.com (mga05.intel.com. [192.55.52.43])
        by mx.google.com with ESMTPS id h5si12483353pgv.48.2017.12.20.13.13.53
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 20 Dec 2017 13:13:53 -0800 (PST)
Date: Wed, 20 Dec 2017 14:13:50 -0700
From: Ross Zwisler <ross.zwisler@linux.intel.com>
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Message-ID: <20171220211350.GA2688@linux.intel.com>
References: <20171214021019.13579-1-ross.zwisler@linux.intel.com>
 <20171214130032.GK16951@dhcp22.suse.cz>
 <20171218203547.GA2366@linux.intel.com>
 <20171220181937.GB12236@bombadil.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org>
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>, Michal Hocko <mhocko@kernel.org>, linux-kernel@vger.kernel.org, "Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>, "Box, David E" <david.e.box@intel.com>, "Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>, "Koss, Marcin" <marcin.koss@intel.com>, "Koziej, Artur" <artur.koziej@intel.com>, "Lahtinen, Joonas" <joonas.lahtinen@intel.com>, "Moore, Robert" <robert.moore@intel.com>, "Nachimuthu, Murugasamy" <murugasamy.nachimuthu@intel.com>, "Odzioba, Lukasz" <lukasz.odzioba@intel.com>, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, "Schmauss, Erik" <erik.schmauss@intel.com>, "Verma, Vishal L" <vishal.l.verma@intel.com>, "Zheng, Lv" <lv.zheng@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Balbir Singh <bsingharora@gmail.com>, Brice Goglin <brice.goglin@gmail.com>, Dan Williams <dan.j.williams@intel.com>, Dave Hansen <dave.hansen@intel.com>, Jerome Glisse <jglisse@redhat.com>, John Hubbard <jhubbard@nvidia.com>, Len Brown <lenb@kernel.org>, Tim Chen <tim.c.chen@linux.intel.com>, devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-api@vger.kernel.org, linuxppc-dev@lists.ozlabs.org

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============9134330909207175583=="
MIME-Version: 1.0
From: Ross Zwisler <ross.zwisler at linux.intel.com>
Subject: Re: [Devel] [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Date: Wed, 20 Dec 2017 14:13:50 -0700
Message-ID: <20171220211350.GA2688@linux.intel.com>
In-Reply-To: 20171220181937.GB12236@bombadil.infradead.org
List-ID: <devel@acpica.org>
To: devel@acpica.org

--===============9134330909207175583==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nod=
es to
> > select with existing utilities like numactl.  This series does not curr=
ently
> > alter any kernel behavior, it only provides a sysfs interface.
> > =

> > Say for example you had a system with some high bandwidth memory (HBM),=
 and
> > you wanted to use it for a specific application.  You could use the sys=
fs
> > representation of the HMAT to figure out which memory target held your =
HBM.
> > You could do this by looking at the local bandwidth values for the vari=
ous
> > memory targets, so:
> > =

> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > =

> > and look for the one that corresponds to your HBM speed. (These numbers=
 are
> > made up, but you get the idea.)
> =

> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementati=
on
was to try and make the sysfs representation as platform agnostic as possib=
le,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but tryi=
ng
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> =

> > Once you know the NUMA node of your HBM, you can figure out the NUMA no=
de of
> > it's local initiator:
> > =

> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > =

> > So, in our made-up example our HBM is located in numa node 2, and the l=
ocal
> > CPU for that HBM is at numa node 0.
> =

> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking=
 at
it in sort of a storage initiator/target way is beneficial, though, because=
 it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and tar=
get
proximity domains (and thus nodes) always represents a system with N proxim=
ity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) th=
at
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwi=
dth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=3D48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology.=
 :)

--===============9134330909207175583==--