[LSF/MM TOPIC] memory reclaim with NUMA rebalancing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] memory reclaim with NUMA rebalancing
@ 2019-01-30 17:48 ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2019-01-30 17:48 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, LKML, linux-nvme

Hi,
I would like to propose the following topic for the MM track. Different
group of people would like to use NVIDMMs as a low cost & slower memory
which is presented to the system as a NUMA node. We do have a NUMA API
but it doesn't really fit to "balance the memory between nodes" needs.
People would like to have hot pages in the regular RAM while cold pages
might be at lower speed NUMA nodes. We do have NUMA balancing for
promotion path but there is notIhing for the other direction. Can we
start considering memory reclaim to move pages to more distant and idle
NUMA nodes rather than reclaim them? There are certainly details that
will get quite complicated but I guess it is time to start discussing
this at least.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM TOPIC] memory reclaim with NUMA rebalancing
@ 2019-01-30 17:48 ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2019-01-30 17:48 UTC (permalink / raw)


Hi,
I would like to propose the following topic for the MM track. Different
group of people would like to use NVIDMMs as a low cost & slower memory
which is presented to the system as a NUMA node. We do have a NUMA API
but it doesn't really fit to "balance the memory between nodes" needs.
People would like to have hot pages in the regular RAM while cold pages
might be at lower speed NUMA nodes. We do have NUMA balancing for
promotion path but there is notIhing for the other direction. Can we
start considering memory reclaim to move pages to more distant and idle
NUMA nodes rather than reclaim them? There are certainly details that
will get quite complicated but I guess it is time to start discussing
this at least.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] memory reclaim with NUMA rebalancing
  2019-01-30 17:48 ` Michal Hocko
@ 2019-01-30 18:12   ` Keith Busch
  -1 siblings, 0 replies; 19+ messages in thread
From: Keith Busch @ 2019-01-30 18:12 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm, LKML, linux-nvme

On Wed, Jan 30, 2019 at 06:48:47PM +0100, Michal Hocko wrote:
> Hi,
> I would like to propose the following topic for the MM track. Different
> group of people would like to use NVIDMMs as a low cost & slower memory
> which is presented to the system as a NUMA node. We do have a NUMA API
> but it doesn't really fit to "balance the memory between nodes" needs.
> People would like to have hot pages in the regular RAM while cold pages
> might be at lower speed NUMA nodes. We do have NUMA balancing for
> promotion path but there is notIhing for the other direction. Can we
> start considering memory reclaim to move pages to more distant and idle
> NUMA nodes rather than reclaim them? There are certainly details that
> will get quite complicated but I guess it is time to start discussing
> this at least.

Yes, thanks for the proposal. I would be very interested in this
discussion for MM. I think some of the details for determining such a
migration path are related to the heterogeneous memory attributes I'm
currently trying to export.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM TOPIC] memory reclaim with NUMA rebalancing
@ 2019-01-30 18:12   ` Keith Busch
  0 siblings, 0 replies; 19+ messages in thread
From: Keith Busch @ 2019-01-30 18:12 UTC (permalink / raw)


On Wed, Jan 30, 2019@06:48:47PM +0100, Michal Hocko wrote:
> Hi,
> I would like to propose the following topic for the MM track. Different
> group of people would like to use NVIDMMs as a low cost & slower memory
> which is presented to the system as a NUMA node. We do have a NUMA API
> but it doesn't really fit to "balance the memory between nodes" needs.
> People would like to have hot pages in the regular RAM while cold pages
> might be at lower speed NUMA nodes. We do have NUMA balancing for
> promotion path but there is notIhing for the other direction. Can we
> start considering memory reclaim to move pages to more distant and idle
> NUMA nodes rather than reclaim them? There are certainly details that
> will get quite complicated but I guess it is time to start discussing
> this at least.

Yes, thanks for the proposal. I would be very interested in this
discussion for MM. I think some of the details for determining such a
migration path are related to the heterogeneous memory attributes I'm
currently trying to export.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] memory reclaim with NUMA rebalancing
  2019-01-30 17:48 ` Michal Hocko
  (?)
@ 2019-01-30 23:53   ` Yang Shi
  -1 siblings, 0 replies; 19+ messages in thread
From: Yang Shi @ 2019-01-30 23:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, Linux MM, LKML, linux-nvme

On Wed, Jan 30, 2019 at 9:48 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> Hi,
> I would like to propose the following topic for the MM track. Different
> group of people would like to use NVIDMMs as a low cost & slower memory
> which is presented to the system as a NUMA node. We do have a NUMA API
> but it doesn't really fit to "balance the memory between nodes" needs.
> People would like to have hot pages in the regular RAM while cold pages
> might be at lower speed NUMA nodes. We do have NUMA balancing for
> promotion path but there is notIhing for the other direction. Can we
> start considering memory reclaim to move pages to more distant and idle
> NUMA nodes rather than reclaim them? There are certainly details that
> will get quite complicated but I guess it is time to start discussing
> this at least.

I would be interested in this topic too.  We (Alibaba) do have some
usecases with using NVDIMM as NUMA node.  The node balancing (or
cold/hot data migration) is one of our needs to achieve optimal
performance for some workloads.  I also proposed a related topic.

Regards,
Yang

> --
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM TOPIC] memory reclaim with NUMA rebalancing
@ 2019-01-30 23:53   ` Yang Shi
  0 siblings, 0 replies; 19+ messages in thread
From: Yang Shi @ 2019-01-30 23:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, Linux MM, LKML, linux-nvme

On Wed, Jan 30, 2019 at 9:48 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> Hi,
> I would like to propose the following topic for the MM track. Different
> group of people would like to use NVIDMMs as a low cost & slower memory
> which is presented to the system as a NUMA node. We do have a NUMA API
> but it doesn't really fit to "balance the memory between nodes" needs.
> People would like to have hot pages in the regular RAM while cold pages
> might be at lower speed NUMA nodes. We do have NUMA balancing for
> promotion path but there is notIhing for the other direction. Can we
> start considering memory reclaim to move pages to more distant and idle
> NUMA nodes rather than reclaim them? There are certainly details that
> will get quite complicated but I guess it is time to start discussing
> this at least.

I would be interested in this topic too.  We (Alibaba) do have some
usecases with using NVDIMM as NUMA node.  The node balancing (or
cold/hot data migration) is one of our needs to achieve optimal
performance for some workloads.  I also proposed a related topic.

Regards,
Yang

> --
> Michal Hocko
> SUSE Labs
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM TOPIC] memory reclaim with NUMA rebalancing
@ 2019-01-30 23:53   ` Yang Shi
  0 siblings, 0 replies; 19+ messages in thread
From: Yang Shi @ 2019-01-30 23:53 UTC (permalink / raw)


On Wed, Jan 30, 2019@9:48 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> Hi,
> I would like to propose the following topic for the MM track. Different
> group of people would like to use NVIDMMs as a low cost & slower memory
> which is presented to the system as a NUMA node. We do have a NUMA API
> but it doesn't really fit to "balance the memory between nodes" needs.
> People would like to have hot pages in the regular RAM while cold pages
> might be at lower speed NUMA nodes. We do have NUMA balancing for
> promotion path but there is notIhing for the other direction. Can we
> start considering memory reclaim to move pages to more distant and idle
> NUMA nodes rather than reclaim them? There are certainly details that
> will get quite complicated but I guess it is time to start discussing
> this at least.

I would be interested in this topic too.  We (Alibaba) do have some
usecases with using NVDIMM as NUMA node.  The node balancing (or
cold/hot data migration) is one of our needs to achieve optimal
performance for some workloads.  I also proposed a related topic.

Regards,
Yang

> --
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
  2019-01-30 17:48 ` Michal Hocko
                   ` (2 preceding siblings ...)
  (?)
@ 2019-01-31  6:49 ` Aneesh Kumar K.V
  2019-02-06 19:03     ` Christopher Lameter
  2019-02-23 13:27     ` Fengguang Wu
  -1 siblings, 2 replies; 19+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-31  6:49 UTC (permalink / raw)
  To: Michal Hocko, lsf-pc; +Cc: linux-mm, LKML, linux-nvme

Michal Hocko <mhocko@kernel.org> writes:

> Hi,
> I would like to propose the following topic for the MM track. Different
> group of people would like to use NVIDMMs as a low cost & slower memory
> which is presented to the system as a NUMA node. We do have a NUMA API
> but it doesn't really fit to "balance the memory between nodes" needs.
> People would like to have hot pages in the regular RAM while cold pages
> might be at lower speed NUMA nodes. We do have NUMA balancing for
> promotion path but there is notIhing for the other direction. Can we
> start considering memory reclaim to move pages to more distant and idle
> NUMA nodes rather than reclaim them? There are certainly details that
> will get quite complicated but I guess it is time to start discussing
> this at least.

I would be interested in this topic too. I would like to
understand the API and how it can help exploit the different type of
devices we have on OpenCAPI.

IMHO there are few proposals related to this which we could discuss together

1. HMAT series which want to expose these devices as Numa nodes
2. The patch series from Dave Hansen which just uses Pmem as Numa node.
3. The patch series from Fengguang Wu which does prevent default
allocation from these numa nodes by excluding them from zone list.
4. The patch series from Jerome Glisse which doesn't expose these as
numa nodes.

IMHO (3) is suggesting that we really don't want them as numa nodes. But
since Numa is the only interface we currently have to present them as
memory and control the allocation and migration we are forcing
ourselves to Numa nodes and then excluding them from default allocation.

-aneesh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
  2019-01-31  6:49 ` [LSF/MM ATTEND ] " Aneesh Kumar K.V
  2019-02-06 19:03     ` Christopher Lameter
@ 2019-02-06 19:03     ` Christopher Lameter
  1 sibling, 0 replies; 19+ messages in thread
From: Christopher Lameter @ 2019-02-06 19:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Michal Hocko, lsf-pc, linux-mm, LKML, linux-nvme

On Thu, 31 Jan 2019, Aneesh Kumar K.V wrote:

> I would be interested in this topic too. I would like to
> understand the API and how it can help exploit the different type of
> devices we have on OpenCAPI.

So am I. We may want to rethink the whole NUMA API and the way we handle
different types of memory with their divergent performance
characteristics.

We need some way to allow a better selection of memory from the kernel
without creating too much complexity. We have new characteristics to
cover:

1. Persistence (NVRAM) or generally a storage device that allows access to
   the medium via a RAM like interface.

2. Coprocessor memory that can be shuffled back and forth to a device
   (HMM).

3. On Device memory (important since PCIe limitations are currently a
   problem and Intel is stuck on PCIe3 and devices start to bypass the
   processor to gain performance)

4. High Density RAM (GDDR f.e.) with different caching behavior
   and/or different cacheline sizes.

5. Modifying access characteristics by reserving slice of a cache (f.e.
   L3) for a specific memory region.

6. SRAM support (high speed memory on the processor itself or by using
   the processor cache to persist a cacheline)

And then the old NUMA stuff where only the latency to memory varies. But
that was a particular solution targeted at scaling SMP system through
interconnects. This was a mostly symmetric approach. The use of
accellerators etc etc and the above characteristics lead to more complex
assymmetric memory approaches that may be difficult to manage and use from
kernel space.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
@ 2019-02-06 19:03     ` Christopher Lameter
  0 siblings, 0 replies; 19+ messages in thread
From: Christopher Lameter @ 2019-02-06 19:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Michal Hocko, lsf-pc, linux-mm, LKML, linux-nvme

On Thu, 31 Jan 2019, Aneesh Kumar K.V wrote:

> I would be interested in this topic too. I would like to
> understand the API and how it can help exploit the different type of
> devices we have on OpenCAPI.

So am I. We may want to rethink the whole NUMA API and the way we handle
different types of memory with their divergent performance
characteristics.

We need some way to allow a better selection of memory from the kernel
without creating too much complexity. We have new characteristics to
cover:

1. Persistence (NVRAM) or generally a storage device that allows access to
   the medium via a RAM like interface.

2. Coprocessor memory that can be shuffled back and forth to a device
   (HMM).

3. On Device memory (important since PCIe limitations are currently a
   problem and Intel is stuck on PCIe3 and devices start to bypass the
   processor to gain performance)

4. High Density RAM (GDDR f.e.) with different caching behavior
   and/or different cacheline sizes.

5. Modifying access characteristics by reserving slice of a cache (f.e.
   L3) for a specific memory region.

6. SRAM support (high speed memory on the processor itself or by using
   the processor cache to persist a cacheline)

And then the old NUMA stuff where only the latency to memory varies. But
that was a particular solution targeted at scaling SMP system through
interconnects. This was a mostly symmetric approach. The use of
accellerators etc etc and the above characteristics lead to more complex
assymmetric memory approaches that may be difficult to manage and use from
kernel space.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
@ 2019-02-06 19:03     ` Christopher Lameter
  0 siblings, 0 replies; 19+ messages in thread
From: Christopher Lameter @ 2019-02-06 19:03 UTC (permalink / raw)

On Thu, 31 Jan 2019, Aneesh Kumar K.V wrote:

> I would be interested in this topic too. I would like to
> understand the API and how it can help exploit the different type of
> devices we have on OpenCAPI.

So am I. We may want to rethink the whole NUMA API and the way we handle
different types of memory with their divergent performance
characteristics.

We need some way to allow a better selection of memory from the kernel
without creating too much complexity. We have new characteristics to
cover:

1. Persistence (NVRAM) or generally a storage device that allows access to
   the medium via a RAM like interface.

2. Coprocessor memory that can be shuffled back and forth to a device
   (HMM).

3. On Device memory (important since PCIe limitations are currently a
   problem and Intel is stuck on PCIe3 and devices start to bypass the
   processor to gain performance)

4. High Density RAM (GDDR f.e.) with different caching behavior
   and/or different cacheline sizes.

5. Modifying access characteristics by reserving slice of a cache (f.e.
   L3) for a specific memory region.

6. SRAM support (high speed memory on the processor itself or by using
   the processor cache to persist a cacheline)

And then the old NUMA stuff where only the latency to memory varies. But
that was a particular solution targeted at scaling SMP system through
interconnects. This was a mostly symmetric approach. The use of
accellerators etc etc and the above characteristics lead to more complex
assymmetric memory approaches that may be difficult to manage and use from
kernel space.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
  2019-02-06 19:03     ` Christopher Lameter
@ 2019-02-22 13:48       ` Jonathan Cameron
  -1 siblings, 0 replies; 19+ messages in thread
From: Jonathan Cameron @ 2019-02-22 13:48 UTC (permalink / raw)

On Wed, 6 Feb 2019 19:03:48 +0000
Christopher Lameter <cl@linux.com> wrote:

> On Thu, 31 Jan 2019, Aneesh Kumar K.V wrote:
> 
> > I would be interested in this topic too. I would like to
> > understand the API and how it can help exploit the different type of
> > devices we have on OpenCAPI.  

I'll second this from CCIX as well ;)  We get more crazy with topologies than
even OpenCAPI but thankfully it'll probably be a little while before full plug
in and play topology building occurs, so we have time to get this right.

> 
> So am I. We may want to rethink the whole NUMA API and the way we handle
> different types of memory with their divergent performance
> characteristics.
> 
> We need some way to allow a better selection of memory from the kernel
> without creating too much complexity. We have new characteristics to
> cover:
> 
> 1. Persistence (NVRAM) or generally a storage device that allows access to
>    the medium via a RAM like interface.

We definitely have this one, with all the usecases that turn up anywhere
including importantly the cheap extremely large ram option.

> 
> 2. Coprocessor memory that can be shuffled back and forth to a device
>    (HMM).

I'm not sure how this applies to fully coherent device memory.  In those
cases you 'might' want to shuffle the memory to the device, but it is
incredibly usecase dependent on whether that makes more sense than
simply relying on your coherent caches at the device to deal with it.

One key thing here is access to the information on who is using
the memory.  NUMA balancing is fine, but often much finer, or more
long term statistical data info is needed.  So basically similar
to the hot page tracking work, but with tracking of 'who' accessed
it (needs hardware support to avoid the cost of current NUMA
balancing?)

Performance measurement units can help with this where present, but we
need a means to poke that information into what ever is handling placement
/migration decisions.
(I do like the user space aspect of the Intel hot page migration patch
as it lets us play a lot more in this area - particularly prior to any
standards being defined.)

For us (allowing for hardware tracking of ATCs) etc the recent
migration of hot / cold pages set in and out of NVDIMMs only covers
the simplest of cases (expansion memory) where the topology if really
straight forward. It's a good step, but perhaps only a first one...

> 
> 3. On Device memory (important since PCIe limitations are currently a
>    problem and Intel is stuck on PCIe3 and devices start to bypass the
>    processor to gain performance)

Whilst it's not so bad on CCIX or our platforms in general, PCIe 5.0+ is
still some way off and I'm sure there are already applications that
are bandwidth limited at 64GBit/S.  However, having said that, we are
interested in peer 2 peer migration of memory between devices
(probably all still coherent but in theory doesn't have to be).
Once we get complex accelerator interactions on large Fabrics, knowing
what to do here gets really tricky.  You can do some of this with aware
user space code and current NUMA interfaces.  There are also fun side
decisions such as where to put your pagetables in such a system as
the walker and the translation user may not be anywhere near each other
or anywhere near the memory being used.

> 
> 4. High Density RAM (GDDR f.e.) with different caching behavior
>    and/or different cacheline sizes.

That is an interesting one, particularly when we have caches out
in the interconnect. Gets really interesting if those caches are
shared by multiple memories and you may or may not have partitioning +
really complex cache implementations and hardware trickery.

Basically it's more memory heterogeneity, just wrt to caches in the
path.

> 
> 5. Modifying access characteristics by reserving slice of a cache (f.e.
>    L3) for a specific memory region.

A possible complexity, as is reservations for particular process groups.

> 
> 6. SRAM support (high speed memory on the processor itself or by using
>    the processor cache to persist a cacheline)
> 
> And then the old NUMA stuff where only the latency to memory varies. But
> that was a particular solution targeted at scaling SMP system through
> interconnects. This was a mostly symmetric approach. The use of
> accellerators etc etc and the above characteristics lead to more complex
> assymmetric memory approaches that may be difficult to manage and use from
> kernel space.
> 

Agreed entirely on this last point.  This stuff is getting really complex,
and people have an annoying habit of just expecting it to work well.  Moving
the burden of memory placement to user space (with enough description
of the hardware for it to make a good decision) seems a good idea to me.

This is particularly true whilst some of the hardware design decisions
are still up in the air.  Clearly there are aspects that we want to 'just
work' that make sense in kernel, but how do we ensure we have enough hooks
to allow smart userspace code to make the decisions without having to work
around the the in kernel management?

It's worth noting the hardware people are often open to suggestions for what
info software will actually used.  Some of the complexity of that decision
space could definitely be reduced if we get some agreement on what the kernel
needs to know, so we can push for hardware that can self describe.
There are also cases where specifications wait on the kernel community coming to
some consensus so to ensure the hardware matches the requirements.

It is also worth noting that the kernel community has various paths (including
some on this list) to feedback into the firmware specifications etc.  If
there are things the kernel needs to magically know, then we propose
changes at all levels: Hardware specs, firmware, (kernel obviously), user space.

It has been raised before in a number of related threads, but it is worth
keeping in mind the questions:

1) How much effort will userspace put into using any controls we give it?
   HPC people might well, but their platforms tend to be repeated a lot,
   so they will sometimes take the time to hand tune to a particular hardware
   configuration.

2) Does the 'normal' user need this complexity soon?  We need to make sure
   things work well with defaults, if this heterogeneous hardware starts
   turning up in highly varied configurations in workstations / servers.

While I'm highly interested in this area, I'm not an mm specialist. I want
solutions, but I'm sure most of the ideas I have are crazy ;)  Seeing the
hardware coming down the line, crazy may be needed.

Jonathan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
@ 2019-02-22 13:48       ` Jonathan Cameron
  0 siblings, 0 replies; 19+ messages in thread
From: Jonathan Cameron @ 2019-02-22 13:48 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Aneesh Kumar K.V, Michal Hocko, lsf-pc, linux-mm, LKML, linux-nvme

On Wed, 6 Feb 2019 19:03:48 +0000
Christopher Lameter <cl@linux.com> wrote:

> On Thu, 31 Jan 2019, Aneesh Kumar K.V wrote:
> 
> > I would be interested in this topic too. I would like to
> > understand the API and how it can help exploit the different type of
> > devices we have on OpenCAPI.  

I'll second this from CCIX as well ;)  We get more crazy with topologies than
even OpenCAPI but thankfully it'll probably be a little while before full plug
in and play topology building occurs, so we have time to get this right.

> 
> So am I. We may want to rethink the whole NUMA API and the way we handle
> different types of memory with their divergent performance
> characteristics.
> 
> We need some way to allow a better selection of memory from the kernel
> without creating too much complexity. We have new characteristics to
> cover:
> 
> 1. Persistence (NVRAM) or generally a storage device that allows access to
>    the medium via a RAM like interface.

We definitely have this one, with all the usecases that turn up anywhere
including importantly the cheap extremely large ram option.

> 
> 2. Coprocessor memory that can be shuffled back and forth to a device
>    (HMM).

I'm not sure how this applies to fully coherent device memory.  In those
cases you 'might' want to shuffle the memory to the device, but it is
incredibly usecase dependent on whether that makes more sense than
simply relying on your coherent caches at the device to deal with it.

One key thing here is access to the information on who is using
the memory.  NUMA balancing is fine, but often much finer, or more
long term statistical data info is needed.  So basically similar
to the hot page tracking work, but with tracking of 'who' accessed
it (needs hardware support to avoid the cost of current NUMA
balancing?)

Performance measurement units can help with this where present, but we
need a means to poke that information into what ever is handling placement
/migration decisions.
(I do like the user space aspect of the Intel hot page migration patch
as it lets us play a lot more in this area - particularly prior to any
standards being defined.)

For us (allowing for hardware tracking of ATCs) etc the recent
migration of hot / cold pages set in and out of NVDIMMs only covers
the simplest of cases (expansion memory) where the topology if really
straight forward. It's a good step, but perhaps only a first one...

> 
> 3. On Device memory (important since PCIe limitations are currently a
>    problem and Intel is stuck on PCIe3 and devices start to bypass the
>    processor to gain performance)

Whilst it's not so bad on CCIX or our platforms in general, PCIe 5.0+ is
still some way off and I'm sure there are already applications that
are bandwidth limited at 64GBit/S.  However, having said that, we are
interested in peer 2 peer migration of memory between devices
(probably all still coherent but in theory doesn't have to be).
Once we get complex accelerator interactions on large Fabrics, knowing
what to do here gets really tricky.  You can do some of this with aware
user space code and current NUMA interfaces.  There are also fun side
decisions such as where to put your pagetables in such a system as
the walker and the translation user may not be anywhere near each other
or anywhere near the memory being used.

> 
> 4. High Density RAM (GDDR f.e.) with different caching behavior
>    and/or different cacheline sizes.

That is an interesting one, particularly when we have caches out
in the interconnect. Gets really interesting if those caches are
shared by multiple memories and you may or may not have partitioning +
really complex cache implementations and hardware trickery.

Basically it's more memory heterogeneity, just wrt to caches in the
path.

> 
> 5. Modifying access characteristics by reserving slice of a cache (f.e.
>    L3) for a specific memory region.

A possible complexity, as is reservations for particular process groups.

> 
> 6. SRAM support (high speed memory on the processor itself or by using
>    the processor cache to persist a cacheline)
> 
> And then the old NUMA stuff where only the latency to memory varies. But
> that was a particular solution targeted at scaling SMP system through
> interconnects. This was a mostly symmetric approach. The use of
> accellerators etc etc and the above characteristics lead to more complex
> assymmetric memory approaches that may be difficult to manage and use from
> kernel space.
> 

Agreed entirely on this last point.  This stuff is getting really complex,
and people have an annoying habit of just expecting it to work well.  Moving
the burden of memory placement to user space (with enough description
of the hardware for it to make a good decision) seems a good idea to me.

This is particularly true whilst some of the hardware design decisions
are still up in the air.  Clearly there are aspects that we want to 'just
work' that make sense in kernel, but how do we ensure we have enough hooks
to allow smart userspace code to make the decisions without having to work
around the the in kernel management?

It's worth noting the hardware people are often open to suggestions for what
info software will actually used.  Some of the complexity of that decision
space could definitely be reduced if we get some agreement on what the kernel
needs to know, so we can push for hardware that can self describe.
There are also cases where specifications wait on the kernel community coming to
some consensus so to ensure the hardware matches the requirements.

It is also worth noting that the kernel community has various paths (including
some on this list) to feedback into the firmware specifications etc.  If
there are things the kernel needs to magically know, then we propose
changes at all levels: Hardware specs, firmware, (kernel obviously), user space.

It has been raised before in a number of related threads, but it is worth
keeping in mind the questions:

1) How much effort will userspace put into using any controls we give it?
   HPC people might well, but their platforms tend to be repeated a lot,
   so they will sometimes take the time to hand tune to a particular hardware
   configuration.

2) Does the 'normal' user need this complexity soon?  We need to make sure
   things work well with defaults, if this heterogeneous hardware starts
   turning up in highly varied configurations in workstations / servers.

While I'm highly interested in this area, I'm not an mm specialist. I want
solutions, but I'm sure most of the ideas I have are crazy ;)  Seeing the
hardware coming down the line, crazy may be needed.

Jonathan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
  2019-02-06 19:03     ` Christopher Lameter
@ 2019-02-22 14:12       ` Larry Woodman
  -1 siblings, 0 replies; 19+ messages in thread
From: Larry Woodman @ 2019-02-22 14:12 UTC (permalink / raw)
  To: Christopher Lameter, Aneesh Kumar K.V
  Cc: Michal Hocko, lsf-pc, linux-mm, LKML, linux-nvme

On 02/06/2019 02:03 PM, Christopher Lameter wrote:
> On Thu, 31 Jan 2019, Aneesh Kumar K.V wrote:
>
>> I would be interested in this topic too. I would like to
>> understand the API and how it can help exploit the different type of
>> devices we have on OpenCAPI.
Same here, we/RedHat have quite a bit of experience running on several
large system
(32TB/128nodes/1024CPUs).  Some of these systems have NVRAM and can operated
in memory mode as well as storage mode.

Larry

> So am I. We may want to rethink the whole NUMA API and the way we handle
> different types of memory with their divergent performance
> characteristics.
>
> We need some way to allow a better selection of memory from the kernel
> without creating too much complexity. We have new characteristics to
> cover:
>
> 1. Persistence (NVRAM) or generally a storage device that allows access to
>    the medium via a RAM like interface.
>
> 2. Coprocessor memory that can be shuffled back and forth to a device
>    (HMM).
>
> 3. On Device memory (important since PCIe limitations are currently a
>    problem and Intel is stuck on PCIe3 and devices start to bypass the
>    processor to gain performance)
>
> 4. High Density RAM (GDDR f.e.) with different caching behavior
>    and/or different cacheline sizes.
>
> 5. Modifying access characteristics by reserving slice of a cache (f.e.
>    L3) for a specific memory region.
>
> 6. SRAM support (high speed memory on the processor itself or by using
>    the processor cache to persist a cacheline)
>
> And then the old NUMA stuff where only the latency to memory varies. But
> that was a particular solution targeted at scaling SMP system through
> interconnects. This was a mostly symmetric approach. The use of
> accellerators etc etc and the above characteristics lead to more complex
> assymmetric memory approaches that may be difficult to manage and use from
> kernel space.
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
@ 2019-02-22 14:12       ` Larry Woodman
  0 siblings, 0 replies; 19+ messages in thread
From: Larry Woodman @ 2019-02-22 14:12 UTC (permalink / raw)


On 02/06/2019 02:03 PM, Christopher Lameter wrote:
> On Thu, 31 Jan 2019, Aneesh Kumar K.V wrote:
>
>> I would be interested in this topic too. I would like to
>> understand the API and how it can help exploit the different type of
>> devices we have on OpenCAPI.
Same here, we/RedHat have quite a bit of experience running on several
large system
(32TB/128nodes/1024CPUs).  Some of these systems have NVRAM and can operated
in memory mode as well as storage mode.

Larry

> So am I. We may want to rethink the whole NUMA API and the way we handle
> different types of memory with their divergent performance
> characteristics.
>
> We need some way to allow a better selection of memory from the kernel
> without creating too much complexity. We have new characteristics to
> cover:
>
> 1. Persistence (NVRAM) or generally a storage device that allows access to
>    the medium via a RAM like interface.
>
> 2. Coprocessor memory that can be shuffled back and forth to a device
>    (HMM).
>
> 3. On Device memory (important since PCIe limitations are currently a
>    problem and Intel is stuck on PCIe3 and devices start to bypass the
>    processor to gain performance)
>
> 4. High Density RAM (GDDR f.e.) with different caching behavior
>    and/or different cacheline sizes.
>
> 5. Modifying access characteristics by reserving slice of a cache (f.e.
>    L3) for a specific memory region.
>
> 6. SRAM support (high speed memory on the processor itself or by using
>    the processor cache to persist a cacheline)
>
> And then the old NUMA stuff where only the latency to memory varies. But
> that was a particular solution targeted at scaling SMP system through
> interconnects. This was a mostly symmetric approach. The use of
> accellerators etc etc and the above characteristics lead to more complex
> assymmetric memory approaches that may be difficult to manage and use from
> kernel space.
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
  2019-01-31  6:49 ` [LSF/MM ATTEND ] " Aneesh Kumar K.V
@ 2019-02-23 13:27     ` Fengguang Wu
  2019-02-23 13:27     ` Fengguang Wu
  1 sibling, 0 replies; 19+ messages in thread
From: Fengguang Wu @ 2019-02-23 13:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Michal Hocko, lsf-pc, linux-mm, LKML, linux-nvme

On Thu, Jan 31, 2019 at 12:19:47PM +0530, Aneesh Kumar K.V wrote:
>Michal Hocko <mhocko@kernel.org> writes:
>
>> Hi,
>> I would like to propose the following topic for the MM track. Different
>> group of people would like to use NVIDMMs as a low cost & slower memory
>> which is presented to the system as a NUMA node. We do have a NUMA API
>> but it doesn't really fit to "balance the memory between nodes" needs.
>> People would like to have hot pages in the regular RAM while cold pages
>> might be at lower speed NUMA nodes. We do have NUMA balancing for
>> promotion path but there is notIhing for the other direction. Can we
>> start considering memory reclaim to move pages to more distant and idle
>> NUMA nodes rather than reclaim them? There are certainly details that
>> will get quite complicated but I guess it is time to start discussing
>> this at least.
>
>I would be interested in this topic too. I would like to understand

So do me. I'd be glad to take in the discussions if can attend the slot.

>the API and how it can help exploit the different type of devices we
>have on OpenCAPI.
>
>IMHO there are few proposals related to this which we could discuss together
>
>1. HMAT series which want to expose these devices as Numa nodes
>2. The patch series from Dave Hansen which just uses Pmem as Numa node.
>3. The patch series from Fengguang Wu which does prevent default
>allocation from these numa nodes by excluding them from zone list.
>4. The patch series from Jerome Glisse which doesn't expose these as
>numa nodes.
>
>IMHO (3) is suggesting that we really don't want them as numa nodes. But
>since Numa is the only interface we currently have to present them as
>memory and control the allocation and migration we are forcing
>ourselves to Numa nodes and then excluding them from default allocation.

Regarding (3), we actually made a default policy choice for
"separating fallback zonelists for PMEM/DRAM nodes" for the
typical use scenarios.

In long term, it's better to not build such assumption into kernel.
There may well be workloads that are cost sensitive rather than
performance sensitive. Suppose people buy a machine with tiny DRAM
and large PMEM. In which case the suitable policy may be to

1) prefer (but not bind) slab etc. kernel pages in DRAM
2) allocate LRU etc. pages from either DRAM or PMEM node

In summary, kernel may offer flexibility for different policies for
use by different users. PMEM has different characteristics comparing
to DRAM, users may or may not be treated differently than DRAM through
policies.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
@ 2019-02-23 13:27     ` Fengguang Wu
  0 siblings, 0 replies; 19+ messages in thread
From: Fengguang Wu @ 2019-02-23 13:27 UTC (permalink / raw)


On Thu, Jan 31, 2019@12:19:47PM +0530, Aneesh Kumar K.V wrote:
>Michal Hocko <mhocko at kernel.org> writes:
>
>> Hi,
>> I would like to propose the following topic for the MM track. Different
>> group of people would like to use NVIDMMs as a low cost & slower memory
>> which is presented to the system as a NUMA node. We do have a NUMA API
>> but it doesn't really fit to "balance the memory between nodes" needs.
>> People would like to have hot pages in the regular RAM while cold pages
>> might be at lower speed NUMA nodes. We do have NUMA balancing for
>> promotion path but there is notIhing for the other direction. Can we
>> start considering memory reclaim to move pages to more distant and idle
>> NUMA nodes rather than reclaim them? There are certainly details that
>> will get quite complicated but I guess it is time to start discussing
>> this at least.
>
>I would be interested in this topic too. I would like to understand

So do me. I'd be glad to take in the discussions if can attend the slot.

>the API and how it can help exploit the different type of devices we
>have on OpenCAPI.
>
>IMHO there are few proposals related to this which we could discuss together
>
>1. HMAT series which want to expose these devices as Numa nodes
>2. The patch series from Dave Hansen which just uses Pmem as Numa node.
>3. The patch series from Fengguang Wu which does prevent default
>allocation from these numa nodes by excluding them from zone list.
>4. The patch series from Jerome Glisse which doesn't expose these as
>numa nodes.
>
>IMHO (3) is suggesting that we really don't want them as numa nodes. But
>since Numa is the only interface we currently have to present them as
>memory and control the allocation and migration we are forcing
>ourselves to Numa nodes and then excluding them from default allocation.

Regarding (3), we actually made a default policy choice for
"separating fallback zonelists for PMEM/DRAM nodes" for the
typical use scenarios.

In long term, it's better to not build such assumption into kernel.
There may well be workloads that are cost sensitive rather than
performance sensitive. Suppose people buy a machine with tiny DRAM
and large PMEM. In which case the suitable policy may be to

1) prefer (but not bind) slab etc. kernel pages in DRAM
2) allocate LRU etc. pages from either DRAM or PMEM node

In summary, kernel may offer flexibility for different policies for
use by different users. PMEM has different characteristics comparing
to DRAM, users may or may not be treated differently than DRAM through
policies.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
  2019-02-23 13:27     ` Fengguang Wu
@ 2019-02-23 13:42       ` Fengguang Wu
  -1 siblings, 0 replies; 19+ messages in thread
From: Fengguang Wu @ 2019-02-23 13:42 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Michal Hocko, lsf-pc, linux-mm, LKML, linux-nvme

On Sat, Feb 23, 2019 at 09:27:48PM +0800, Fengguang Wu wrote:
>On Thu, Jan 31, 2019 at 12:19:47PM +0530, Aneesh Kumar K.V wrote:
>>Michal Hocko <mhocko@kernel.org> writes:
>>
>>> Hi,
>>> I would like to propose the following topic for the MM track. Different
>>> group of people would like to use NVIDMMs as a low cost & slower memory
>>> which is presented to the system as a NUMA node. We do have a NUMA API
>>> but it doesn't really fit to "balance the memory between nodes" needs.
>>> People would like to have hot pages in the regular RAM while cold pages
>>> might be at lower speed NUMA nodes. We do have NUMA balancing for
>>> promotion path but there is notIhing for the other direction. Can we
>>> start considering memory reclaim to move pages to more distant and idle
>>> NUMA nodes rather than reclaim them? There are certainly details that
>>> will get quite complicated but I guess it is time to start discussing
>>> this at least.
>>
>>I would be interested in this topic too. I would like to understand
>
>So do me. I'd be glad to take in the discussions if can attend the slot.
>
>>the API and how it can help exploit the different type of devices we
>>have on OpenCAPI.
>>
>>IMHO there are few proposals related to this which we could discuss together
>>
>>1. HMAT series which want to expose these devices as Numa nodes
>>2. The patch series from Dave Hansen which just uses Pmem as Numa node.
>>3. The patch series from Fengguang Wu which does prevent default
>>allocation from these numa nodes by excluding them from zone list.
>>4. The patch series from Jerome Glisse which doesn't expose these as
>>numa nodes.
>>
>>IMHO (3) is suggesting that we really don't want them as numa nodes. But
>>since Numa is the only interface we currently have to present them as
>>memory and control the allocation and migration we are forcing
>>ourselves to Numa nodes and then excluding them from default allocation.
>
>Regarding (3), we actually made a default policy choice for
>"separating fallback zonelists for PMEM/DRAM nodes" for the
>typical use scenarios.
>
>In long term, it's better to not build such assumption into kernel.
>There may well be workloads that are cost sensitive rather than
>performance sensitive. Suppose people buy a machine with tiny DRAM
>and large PMEM. In which case the suitable policy may be to
>
>1) prefer (but not bind) slab etc. kernel pages in DRAM
>2) allocate LRU etc. pages from either DRAM or PMEM node

The point is not separating fallback zonelists for DRAM and PMEM in
this case.

>In summary, kernel may offer flexibility for different policies for
>use by different users. PMEM has different characteristics comparing
>to DRAM, users may or may not be treated differently than DRAM through
>policies.
>
>Thanks,
>Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [LSF/MM ATTEND ] memory reclaim with NUMA rebalancing
@ 2019-02-23 13:42       ` Fengguang Wu
  0 siblings, 0 replies; 19+ messages in thread
From: Fengguang Wu @ 2019-02-23 13:42 UTC (permalink / raw)


On Sat, Feb 23, 2019@09:27:48PM +0800, Fengguang Wu wrote:
>On Thu, Jan 31, 2019@12:19:47PM +0530, Aneesh Kumar K.V wrote:
>>Michal Hocko <mhocko at kernel.org> writes:
>>
>>> Hi,
>>> I would like to propose the following topic for the MM track. Different
>>> group of people would like to use NVIDMMs as a low cost & slower memory
>>> which is presented to the system as a NUMA node. We do have a NUMA API
>>> but it doesn't really fit to "balance the memory between nodes" needs.
>>> People would like to have hot pages in the regular RAM while cold pages
>>> might be at lower speed NUMA nodes. We do have NUMA balancing for
>>> promotion path but there is notIhing for the other direction. Can we
>>> start considering memory reclaim to move pages to more distant and idle
>>> NUMA nodes rather than reclaim them? There are certainly details that
>>> will get quite complicated but I guess it is time to start discussing
>>> this at least.
>>
>>I would be interested in this topic too. I would like to understand
>
>So do me. I'd be glad to take in the discussions if can attend the slot.
>
>>the API and how it can help exploit the different type of devices we
>>have on OpenCAPI.
>>
>>IMHO there are few proposals related to this which we could discuss together
>>
>>1. HMAT series which want to expose these devices as Numa nodes
>>2. The patch series from Dave Hansen which just uses Pmem as Numa node.
>>3. The patch series from Fengguang Wu which does prevent default
>>allocation from these numa nodes by excluding them from zone list.
>>4. The patch series from Jerome Glisse which doesn't expose these as
>>numa nodes.
>>
>>IMHO (3) is suggesting that we really don't want them as numa nodes. But
>>since Numa is the only interface we currently have to present them as
>>memory and control the allocation and migration we are forcing
>>ourselves to Numa nodes and then excluding them from default allocation.
>
>Regarding (3), we actually made a default policy choice for
>"separating fallback zonelists for PMEM/DRAM nodes" for the
>typical use scenarios.
>
>In long term, it's better to not build such assumption into kernel.
>There may well be workloads that are cost sensitive rather than
>performance sensitive. Suppose people buy a machine with tiny DRAM
>and large PMEM. In which case the suitable policy may be to
>
>1) prefer (but not bind) slab etc. kernel pages in DRAM
>2) allocate LRU etc. pages from either DRAM or PMEM node

The point is not separating fallback zonelists for DRAM and PMEM in
this case.

>In summary, kernel may offer flexibility for different policies for
>use by different users. PMEM has different characteristics comparing
>to DRAM, users may or may not be treated differently than DRAM through
>policies.
>
>Thanks,
>Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-02-23 13:42 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-30 17:48 [LSF/MM TOPIC] memory reclaim with NUMA rebalancing Michal Hocko
2019-01-30 17:48 ` Michal Hocko
2019-01-30 18:12 ` Keith Busch
2019-01-30 18:12   ` Keith Busch
2019-01-30 23:53 ` Yang Shi
2019-01-30 23:53   ` Yang Shi
2019-01-30 23:53   ` Yang Shi
2019-01-31  6:49 ` [LSF/MM ATTEND ] " Aneesh Kumar K.V
2019-02-06 19:03   ` Christopher Lameter
2019-02-06 19:03     ` Christopher Lameter
2019-02-06 19:03     ` Christopher Lameter
2019-02-22 13:48     ` Jonathan Cameron
2019-02-22 13:48       ` Jonathan Cameron
2019-02-22 14:12     ` Larry Woodman
2019-02-22 14:12       ` Larry Woodman
2019-02-23 13:27   ` Fengguang Wu
2019-02-23 13:27     ` Fengguang Wu
2019-02-23 13:42     ` Fengguang Wu
2019-02-23 13:42       ` Fengguang Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.