All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>,
	"Box, David E" <david.e.box@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	"Zheng, Lv" <lv.zheng@intel.com>,
	linux-nvdimm@lists.01.org,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"Anaczkowski, Lukasz" <lukasz.anaczkowski@intel.com>,
	Moore,Lukasz, "Erik  <erik.schmauss@intel.com>,
	Len Brown" <lenb@kernel.org>, John Hubbard <jhubbard@nvidia.com>,
	linuxppc-dev@lists.ozlabs.org, Jerome Glisse <jglisse@redhat.com>,
	devel@acpica.org, "Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>,
	linux-mm@kvack.org, "Koss, Marcin" <marcin.koss@intel.com>,
	linux-api@vger.kernel.org, Brice Goglin <brice.goglin@gmail.com>,
	"Nachimuthu, Murugasamy  <murugasamy.nachimuthu@intel.com>,
	Rafael J. Wysocki" <rjw@rjwysocki.net>,
	linux-kernel@vger.kernel.org, Koziej,,
	"Joonas  <joonas.lahtinen@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tim Chen" <tim.c.chen@linux.intel.com>
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Date: Wed, 20 Dec 2017 14:13:50 -0700	[thread overview]
Message-ID: <20171220211350.GA2688@linux.intel.com> (raw)
In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org>

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Michal Hocko <mhocko@kernel.org>,
	linux-kernel@vger.kernel.org, "Anaczkowski,
	Lukasz" <lukasz.anaczkowski@intel.com>,
	"Box, David E" <david.e.box@intel.com>,
	"Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>,
	"Koss, Marcin" <marcin.koss@intel.com>,
	"Koziej, Artur" <artur.koziej@intel.com>,
	"Lahtinen, Joonas" <joonas.lahtinen@intel.com>,
	"Moore, Robert" <robert.moore@intel.com>,
	"Nachimuthu, Murugasamy" <murugasamy.nachimuthu@intel.com>,
	"Odzioba, Lukasz" <lukasz.odzioba@intel.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"Schmauss, Erik" <erik.schmauss@intel.com>,
	"Verma, Vishal L" <vishal.l.verma@intel.com>,
	"Zheng, Lv" <lv.zheng@intel.com>, Andrew Morton <akpm@linux->
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Date: Wed, 20 Dec 2017 14:13:50 -0700	[thread overview]
Message-ID: <20171220211350.GA2688@linux.intel.com> (raw)
In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org>

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)

WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Michal Hocko <mhocko@kernel.org>,
	linux-kernel@vger.kernel.org, "Anaczkowski,
	Lukasz" <lukasz.anaczkowski@intel.com>,
	"Box, David E" <david.e.box@intel.com>,
	"Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>,
	"Koss, Marcin" <marcin.koss@intel.com>,
	"Koziej, Artur" <artur.koziej@intel.com>,
	"Lahtinen, Joonas" <joonas.lahtinen@intel.com>,
	"Moore, Robert" <robert.moore@intel.com>,
	"Nachimuthu, Murugasamy" <murugasamy.nachimuthu@intel.com>,
	"Odzioba, Lukasz" <lukasz.odzioba@intel.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"Schmauss, Erik" <erik.schmauss@intel.com>,
	"Verma, Vishal L" <vishal.l.verma@intel.com>,
	"Zheng, Lv" <lv.zheng@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Balbir Singh <bsingharora@gmail.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Jerome Glisse <jglisse@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>, Len Brown <lenb@kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-nvdimm@lists.01.org, linux-api@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Date: Wed, 20 Dec 2017 14:13:50 -0700	[thread overview]
Message-ID: <20171220211350.GA2688@linux.intel.com> (raw)
In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org>

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)

WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Michal Hocko <mhocko@kernel.org>,
	linux-kernel@vger.kernel.org, "Anaczkowski,
	Lukasz" <lukasz.anaczkowski@intel.com>,
	"Box, David E" <david.e.box@intel.com>,
	"Kogut, Jaroslaw" <Jaroslaw.Kogut@intel.com>,
	"Koss, Marcin" <marcin.koss@intel.com>,
	"Koziej, Artur" <artur.koziej@intel.com>,
	"Lahtinen, Joonas" <joonas.lahtinen@intel.com>,
	"Moore, Robert" <robert.moore@intel.com>,
	"Nachimuthu, Murugasamy" <murugasamy.nachimuthu@intel.com>,
	"Odzioba, Lukasz" <lukasz.odzioba@intel.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"Schmauss, Erik" <erik.schmauss@intel.com>,
	"Verma, Vishal L" <vishal.l.verma@intel.com>,
	"Zheng, Lv" <lv.zheng@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Balbir Singh <bsingharora@gmail.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Jerome Glisse <jglisse@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>, Len Brown <lenb@kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	devel@acpica.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-nvdimm@lists.01.org, linux-api@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Date: Wed, 20 Dec 2017 14:13:50 -0700	[thread overview]
Message-ID: <20171220211350.GA2688@linux.intel.com> (raw)
In-Reply-To: <20171220181937.GB12236@bombadil.infradead.org>

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler at linux.intel.com>
To: devel@acpica.org
Subject: Re: [Devel] [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Date: Wed, 20 Dec 2017 14:13:50 -0700	[thread overview]
Message-ID: <20171220211350.GA2688@linux.intel.com> (raw)
In-Reply-To: 20171220181937.GB12236@bombadil.infradead.org

[-- Attachment #1: Type: text/plain, Size: 4875 bytes --]

On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > 	# grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > 	/sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > 	/sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > 	# ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > 	/sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)

  parent reply	other threads:[~2017-12-20 21:09 UTC|newest]

Thread overview: 180+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-14  2:10 [PATCH v3 0/3] create sysfs representation of ACPI HMAT Ross Zwisler
2017-12-14  2:10 ` [Devel] " Ross Zwisler
2017-12-14  2:10 ` Ross Zwisler
2017-12-14  2:10 ` Ross Zwisler
2017-12-14  2:10 ` Ross Zwisler
2017-12-14  2:10 ` [PATCH v3 1/3] acpi: HMAT support in acpi_parse_entries_array() Ross Zwisler
2017-12-14  2:10   ` [Devel] " Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-15  0:49   ` Rafael J. Wysocki
2017-12-15  0:49     ` [Devel] " Rafael J. Wysocki
2017-12-15  0:49     ` Rafael J. Wysocki
2017-12-15  0:49     ` Rafael J. Wysocki
2017-12-15  0:49     ` Rafael J. Wysocki
2017-12-15  1:10   ` Dan Williams
2017-12-15  1:10     ` Dan Williams
2017-12-15  1:10     ` Dan Williams
2017-12-15  1:10     ` Dan Williams
2017-12-16  1:53     ` Rafael J. Wysocki
2017-12-16  1:53       ` Rafael J. Wysocki
2017-12-16  1:53       ` Rafael J. Wysocki
2017-12-16  1:53       ` Rafael J. Wysocki
2017-12-16  1:57       ` Dan Williams
2017-12-16  1:57         ` Dan Williams
2017-12-16  1:57         ` Dan Williams
2017-12-16  2:15         ` Rafael J. Wysocki
2017-12-16  2:15           ` Rafael J. Wysocki
2017-12-16  2:15           ` Rafael J. Wysocki
2017-12-16  2:15           ` Rafael J. Wysocki
2017-12-14  2:10 ` [PATCH v3 2/3] hmat: add heterogeneous memory sysfs support Ross Zwisler
2017-12-14  2:10   ` [Devel] " Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-15  0:52   ` Rafael J. Wysocki
2017-12-15  0:52     ` [Devel] " Rafael J. Wysocki
2017-12-15  0:52     ` Rafael J. Wysocki
2017-12-15  0:52     ` Rafael J. Wysocki
2017-12-15 20:53     ` Ross Zwisler
2017-12-15 20:53       ` [Devel] " Ross Zwisler
2017-12-15 20:53       ` Ross Zwisler
2017-12-15 20:53       ` Ross Zwisler
2017-12-15 20:53       ` Ross Zwisler
2017-12-14  2:10 ` [PATCH v3 3/3] hmat: add performance attributes Ross Zwisler
2017-12-14  2:10   ` [Devel] " Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-14  2:10   ` Ross Zwisler
2017-12-14 13:00 ` [PATCH v3 0/3] create sysfs representation of ACPI HMAT Michal Hocko
2017-12-14 13:00   ` Michal Hocko
2017-12-14 13:00   ` Michal Hocko
2017-12-18 20:35   ` Ross Zwisler
2017-12-18 20:35     ` [Devel] " Ross Zwisler
2017-12-18 20:35     ` Ross Zwisler
2017-12-18 20:35     ` Ross Zwisler
2017-12-18 20:35     ` Ross Zwisler
2017-12-20 16:41     ` Ross Zwisler
2017-12-20 16:41       ` [Devel] " Ross Zwisler
2017-12-20 16:41       ` Ross Zwisler
2017-12-20 16:41       ` Ross Zwisler
2017-12-20 16:41       ` Ross Zwisler
2017-12-21 13:18       ` Michal Hocko
2017-12-21 13:18         ` Michal Hocko
2017-12-21 13:18         ` Michal Hocko
2017-12-21 13:18         ` Michal Hocko
2017-12-20 18:19     ` Matthew Wilcox
2017-12-20 18:19       ` Matthew Wilcox
2017-12-20 18:19       ` Matthew Wilcox
2017-12-20 20:22       ` Dave Hansen
2017-12-20 20:22         ` Dave Hansen
2017-12-20 20:22         ` Dave Hansen
2017-12-20 21:16         ` Matthew Wilcox
2017-12-20 21:16           ` Matthew Wilcox
2017-12-20 21:16           ` Matthew Wilcox
2017-12-20 21:16           ` Matthew Wilcox
2017-12-20 21:24           ` Ross Zwisler
2017-12-20 21:24             ` [Devel] " Ross Zwisler
2017-12-20 21:24             ` Ross Zwisler
2017-12-20 21:24             ` Ross Zwisler
2017-12-20 22:29             ` Dan Williams
2017-12-20 22:29               ` Dan Williams
2017-12-20 22:29               ` Dan Williams
2017-12-20 22:41               ` Ross Zwisler
2017-12-20 22:41                 ` [Devel] " Ross Zwisler
2017-12-20 22:41                 ` Ross Zwisler
2017-12-20 22:41                 ` Ross Zwisler
2017-12-20 22:41                 ` Ross Zwisler
2017-12-21 20:31                 ` Brice Goglin
2017-12-21 20:31                   ` Brice Goglin
2017-12-21 20:31                   ` Brice Goglin
2017-12-21 20:31                   ` Brice Goglin
2017-12-22 22:53                   ` Dan Williams
2017-12-22 22:53                     ` Dan Williams
2017-12-22 22:53                     ` Dan Williams
2017-12-22 22:53                     ` Dan Williams
2017-12-22 22:53                     ` Dan Williams
2017-12-22 23:22                     ` Ross Zwisler
2017-12-22 23:22                       ` [Devel] " Ross Zwisler
2017-12-22 23:22                       ` Ross Zwisler
2017-12-22 23:22                       ` Ross Zwisler
2017-12-22 23:22                       ` Ross Zwisler
2017-12-22 23:57                       ` Dan Williams
2017-12-22 23:57                         ` Dan Williams
2017-12-22 23:57                         ` Dan Williams
2017-12-22 23:57                         ` Dan Williams
2017-12-22 23:57                         ` Dan Williams
2017-12-23  1:14                         ` Rafael J. Wysocki
2017-12-23  1:14                           ` [Devel] " Rafael J. Wysocki
2017-12-23  1:14                           ` Rafael J. Wysocki
2017-12-23  1:14                           ` Rafael J. Wysocki
2017-12-23  1:14                           ` Rafael J. Wysocki
2017-12-23  1:14                           ` Rafael J. Wysocki
2017-12-27  9:10                     ` Brice Goglin
2017-12-27  9:10                       ` Brice Goglin
2017-12-27  9:10                       ` Brice Goglin
2017-12-27  9:10                       ` Brice Goglin
2017-12-30  6:58                       ` Matthew Wilcox
2017-12-30  6:58                         ` Matthew Wilcox
2017-12-30  6:58                         ` Matthew Wilcox
2017-12-30  6:58                         ` Matthew Wilcox
2017-12-30  9:19                         ` Brice Goglin
2017-12-30  9:19                           ` Brice Goglin
2017-12-30  9:19                           ` Brice Goglin
2017-12-30  9:19                           ` Brice Goglin
2017-12-20 21:13       ` Ross Zwisler [this message]
2017-12-20 21:13         ` [Devel] " Ross Zwisler
2017-12-20 21:13         ` Ross Zwisler
2017-12-20 21:13         ` Ross Zwisler
2017-12-20 21:13         ` Ross Zwisler
2017-12-21  1:41         ` Elliott, Robert (Persistent Memory)
2017-12-21  1:41           ` Elliott, Robert (Persistent Memory)
2017-12-21  1:41           ` Elliott, Robert (Persistent Memory)
2017-12-21  1:41           ` Elliott, Robert (Persistent Memory)
2017-12-21  1:41           ` Elliott, Robert (Persistent Memory)
2017-12-22 21:46           ` Ross Zwisler
2017-12-22 21:46             ` [Devel] " Ross Zwisler
2017-12-22 21:46             ` Ross Zwisler
2017-12-22 21:46             ` Ross Zwisler
2017-12-21 12:50       ` Michael Ellerman
2017-12-21 12:50         ` Michael Ellerman
2017-12-21 12:50         ` Michael Ellerman
2017-12-21 12:50         ` Michael Ellerman
2017-12-22  3:09 ` Anshuman Khandual
2017-12-22  3:09   ` Anshuman Khandual
2017-12-22  3:09   ` Anshuman Khandual
2017-12-22  3:09   ` Anshuman Khandual
2017-12-22 10:31   ` Kogut, Jaroslaw
2017-12-22 10:31     ` Kogut, Jaroslaw
2017-12-22 10:31     ` Kogut, Jaroslaw
2017-12-22 14:37     ` Anshuman Khandual
2017-12-22 14:37       ` Anshuman Khandual
2017-12-22 14:37       ` Anshuman Khandual
2017-12-22 14:37       ` Anshuman Khandual
2017-12-22 17:13   ` Dave Hansen
2017-12-22 17:13     ` Dave Hansen
2017-12-22 17:13     ` Dave Hansen
2017-12-22 17:13     ` Dave Hansen
2017-12-23  5:14     ` Anshuman Khandual
2017-12-23  5:14       ` Anshuman Khandual
2017-12-23  5:14       ` Anshuman Khandual
2017-12-23  5:14       ` Anshuman Khandual
2017-12-22 22:13   ` Ross Zwisler
2017-12-22 22:13     ` [Devel] " Ross Zwisler
2017-12-22 22:13     ` Ross Zwisler
2017-12-22 22:13     ` Ross Zwisler
2017-12-22 22:13     ` Ross Zwisler
2017-12-23  6:56     ` Anshuman Khandual
2017-12-23  6:56       ` Anshuman Khandual
2017-12-23  6:56       ` Anshuman Khandual
2017-12-23  6:56       ` Anshuman Khandual
2017-12-22 22:31   ` Ross Zwisler
2017-12-22 22:31     ` [Devel] " Ross Zwisler
2017-12-22 22:31     ` Ross Zwisler
2017-12-22 22:31     ` Ross Zwisler
2017-12-22 22:31     ` Ross Zwisler
2017-12-25  2:05     ` Liubo(OS Lab)
2017-12-25  2:05       ` Liubo(OS Lab)
2017-12-25  2:05       ` Liubo(OS Lab)
2017-12-25  2:05       ` Liubo(OS Lab)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171220211350.GA2688@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=Jaroslaw.Kogut@intel.com \
    --cc=brice.goglin@gmail.com \
    --cc=dave.hansen@intel.com \
    --cc=david.e.box@intel.com \
    --cc=devel@acpica.org \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=lenb@kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lukasz.anaczkowski@intel.com \
    --cc=lv.zheng@intel.com \
    --cc=marcin.koss@intel.com \
    --cc=mhocko@kernel.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rjw@rjwysocki.net \
    --cc=tim.c.chen@linux.intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.