All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Tiered memory accounting and management
@ 2021-06-14 21:51 Tim Chen
  2021-06-16  0:17 ` Yang Shi
  0 siblings, 1 reply; 9+ messages in thread
From: Tim Chen @ 2021-06-14 21:51 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, Michal Hocko, Dan Williams, Dave Hansen


From: Tim Chen <tim.c.chen@linux.intel.com>

Tiered memory accounting and management
------------------------------------------------------------
Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
than others, but a byte of media has about the same cost whether it
is close or far.  But, with new memory tiers such as High-Bandwidth
Memory or Persistent Memory, there is a choice between fast/expensive
and slow/cheap.  But, the current memory cgroups still live in the
old model. There is only one set of limits, and it implies that all
memory has the same cost.  We would like to extend memory cgroups to
comprehend different memory tiers to give users a way to choose a mix
between fast/expensive and slow/cheap.

To manage such memory, we will need to account memory usage and
impose limits for each kind of memory.

There were a couple of approaches that have been discussed previously to partition
the memory between the cgroups listed below.  We will like to
use the LSF/MM session to come to a consensus on the approach to
take.

1.	Per NUMA node limit and accounting for each cgroup.  
We can assign higher limits on better performing memory node for higher priority cgroups.

There are some loose ends here that warrant further discussions: 
(1) A user friendly interface for such limits.  Will a proportional
weight for the cgroup that translate to actual absolute limit be more suitable?
(2) Memory mis-configurations can occur more easily as the admin
has a much larger number of limits spread among between the
cgroups to manage.  Over-restrictive limits can lead to under utilized
and wasted memory and hurt performance. 
(3) OOM behavior when a cgroup hits its limit.

2.	Per memory tier limit and accounting for each cgroup. 
We can assign higher limits on memories in better performing 
memory tier for higher priority cgroups.  I previously
prototyped a soft limit based implementation to demonstrate the 
tiered limit idea.

There are also a number of issues here:
(1)	The advantage is we have fewer limits to deal with simplifying
configuration. However, there are doubts raised by a number 
of people on whether we can really properly classify the NUMA 
nodes into memory tiers. There could still be significant performance 
differences between NUMA nodes even for the same kind of memory.
We will also not have the fine-grained control and flexibility that comes
with a per NUMA node limit.
(2)	Will a memory hierarchy defined by promotion/demotion relationship between
memory nodes be a viable approach for defining memory tiers?

These issues related to  the management of systems with multiple kind of memories
can be ironed out in this session.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Tiered memory accounting and management
  2021-06-14 21:51 [LSF/MM TOPIC] Tiered memory accounting and management Tim Chen
@ 2021-06-16  0:17 ` Yang Shi
  2021-06-17 18:48   ` Shakeel Butt
  0 siblings, 1 reply; 9+ messages in thread
From: Yang Shi @ 2021-06-16  0:17 UTC (permalink / raw)
  To: Tim Chen
  Cc: lsf-pc, Linux MM, Michal Hocko, Dan Williams, Dave Hansen,
	Shakeel Butt, David Rientjes

On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
> From: Tim Chen <tim.c.chen@linux.intel.com>
>
> Tiered memory accounting and management
> ------------------------------------------------------------
> Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
> than others, but a byte of media has about the same cost whether it
> is close or far.  But, with new memory tiers such as High-Bandwidth
> Memory or Persistent Memory, there is a choice between fast/expensive
> and slow/cheap.  But, the current memory cgroups still live in the
> old model. There is only one set of limits, and it implies that all
> memory has the same cost.  We would like to extend memory cgroups to
> comprehend different memory tiers to give users a way to choose a mix
> between fast/expensive and slow/cheap.
>
> To manage such memory, we will need to account memory usage and
> impose limits for each kind of memory.
>
> There were a couple of approaches that have been discussed previously to partition
> the memory between the cgroups listed below.  We will like to
> use the LSF/MM session to come to a consensus on the approach to
> take.
>
> 1.      Per NUMA node limit and accounting for each cgroup.
> We can assign higher limits on better performing memory node for higher priority cgroups.
>
> There are some loose ends here that warrant further discussions:
> (1) A user friendly interface for such limits.  Will a proportional
> weight for the cgroup that translate to actual absolute limit be more suitable?
> (2) Memory mis-configurations can occur more easily as the admin
> has a much larger number of limits spread among between the
> cgroups to manage.  Over-restrictive limits can lead to under utilized
> and wasted memory and hurt performance.
> (3) OOM behavior when a cgroup hits its limit.
>
> 2.      Per memory tier limit and accounting for each cgroup.
> We can assign higher limits on memories in better performing
> memory tier for higher priority cgroups.  I previously
> prototyped a soft limit based implementation to demonstrate the
> tiered limit idea.
>
> There are also a number of issues here:
> (1)     The advantage is we have fewer limits to deal with simplifying
> configuration. However, there are doubts raised by a number
> of people on whether we can really properly classify the NUMA
> nodes into memory tiers. There could still be significant performance
> differences between NUMA nodes even for the same kind of memory.
> We will also not have the fine-grained control and flexibility that comes
> with a per NUMA node limit.
> (2)     Will a memory hierarchy defined by promotion/demotion relationship between
> memory nodes be a viable approach for defining memory tiers?
>
> These issues related to  the management of systems with multiple kind of memories
> can be ironed out in this session.

Thanks for suggesting this topic. I'm interested in the topic and
would like to attend.

Other than the above points. I'm wondering whether we shall discuss
"Migrate Pages in lieu of discard" as well? Dave Hansen is driving the
development and I have been involved in the early development and
review, but it seems there are still some open questions according to
the latest review feedback.

Some other folks may be interested in this topic either, CC'ed them in
the thread.

>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Tiered memory accounting and management
  2021-06-16  0:17 ` Yang Shi
@ 2021-06-17 18:48   ` Shakeel Butt
  2021-06-18 22:11     ` Tim Chen
  2021-06-21 20:42     ` Yang Shi
  0 siblings, 2 replies; 9+ messages in thread
From: Shakeel Butt @ 2021-06-17 18:48 UTC (permalink / raw)
  To: Yang Shi
  Cc: Tim Chen, lsf-pc, Linux MM, Michal Hocko, Dan Williams,
	Dave Hansen, David Rientjes, Wei Xu, Greg Thelen

Thanks Yang for the CC.

On Tue, Jun 15, 2021 at 5:17 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> >
> > From: Tim Chen <tim.c.chen@linux.intel.com>
> >
> > Tiered memory accounting and management
> > ------------------------------------------------------------
> > Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
> > than others, but a byte of media has about the same cost whether it
> > is close or far.  But, with new memory tiers such as High-Bandwidth
> > Memory or Persistent Memory, there is a choice between fast/expensive
> > and slow/cheap.  But, the current memory cgroups still live in the
> > old model. There is only one set of limits, and it implies that all
> > memory has the same cost.  We would like to extend memory cgroups to
> > comprehend different memory tiers to give users a way to choose a mix
> > between fast/expensive and slow/cheap.
> >
> > To manage such memory, we will need to account memory usage and
> > impose limits for each kind of memory.
> >
> > There were a couple of approaches that have been discussed previously to partition
> > the memory between the cgroups listed below.  We will like to
> > use the LSF/MM session to come to a consensus on the approach to
> > take.
> >
> > 1.      Per NUMA node limit and accounting for each cgroup.
> > We can assign higher limits on better performing memory node for higher priority cgroups.
> >
> > There are some loose ends here that warrant further discussions:
> > (1) A user friendly interface for such limits.  Will a proportional
> > weight for the cgroup that translate to actual absolute limit be more suitable?
> > (2) Memory mis-configurations can occur more easily as the admin
> > has a much larger number of limits spread among between the
> > cgroups to manage.  Over-restrictive limits can lead to under utilized
> > and wasted memory and hurt performance.
> > (3) OOM behavior when a cgroup hits its limit.
> >

This (numa based limits) is something I was pushing for but after
discussing this internally with userspace controller devs, I have to
backoff from this position.

The main feedback I got was that setting one memory limit is already
complicated and having to set/adjust these many limits would be
horrifying.

> > 2.      Per memory tier limit and accounting for each cgroup.
> > We can assign higher limits on memories in better performing
> > memory tier for higher priority cgroups.  I previously
> > prototyped a soft limit based implementation to demonstrate the
> > tiered limit idea.
> >
> > There are also a number of issues here:
> > (1)     The advantage is we have fewer limits to deal with simplifying
> > configuration. However, there are doubts raised by a number
> > of people on whether we can really properly classify the NUMA
> > nodes into memory tiers. There could still be significant performance
> > differences between NUMA nodes even for the same kind of memory.
> > We will also not have the fine-grained control and flexibility that comes
> > with a per NUMA node limit.
> > (2)     Will a memory hierarchy defined by promotion/demotion relationship between
> > memory nodes be a viable approach for defining memory tiers?
> >
> > These issues related to  the management of systems with multiple kind of memories
> > can be ironed out in this session.
>
> Thanks for suggesting this topic. I'm interested in the topic and
> would like to attend.
>
> Other than the above points. I'm wondering whether we shall discuss
> "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the
> development and I have been involved in the early development and
> review, but it seems there are still some open questions according to
> the latest review feedback.
>
> Some other folks may be interested in this topic either, CC'ed them in
> the thread.
>

At the moment "personally" I am more inclined towards a passive
approach towards the memcg accounting of memory tiers. By that I mean,
let's start by providing a 'usage' interface and get more
production/real-world data to motivate the 'limit' interfaces. (One
minor reason is that defining the 'limit' interface will force us to
make the decision on defining tiers i.e. numa or a set of numa or
others).

IMHO we should focus more on the "aging" of the application memory and
"migration/balance" between the tiers. I don't think the memory
reclaim infrastructure is the right place for these operations
(unevictable pages are ignored and not accurate ages). What we need is
proactive continuous aging and balancing. We need something like, with
additions, Multi-gen LRUs or DAMON or page idle tracking for aging and
a new mechanism for balancing which takes ages into account.

To give a more concrete example: Let's say we have a system with two
memory tiers and multiple low and high priority jobs. For high
priority jobs, set the allocation try list from high to low tier and
for low priority jobs the reverse of that (I am not sure if we can do
that out of the box with today's kernel). In the background we migrate
cold memory down the tiers and hot memory in the reverse direction.

In this background mechanism we can enforce all different limiting
policies like Yang's original high and low tier percentage or
something like X% of accesses of high priority jobs should be from
high tier. Basically I am saying until we find from production data
that this background mechanism is not strong enough to enforce passive
limits, we should delay the decision on limit interfaces.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Tiered memory accounting and management
  2021-06-17 18:48   ` Shakeel Butt
@ 2021-06-18 22:11     ` Tim Chen
  2021-06-18 23:59       ` Shakeel Butt
  2021-06-21 20:42     ` Yang Shi
  1 sibling, 1 reply; 9+ messages in thread
From: Tim Chen @ 2021-06-18 22:11 UTC (permalink / raw)
  To: Shakeel Butt, Yang Shi
  Cc: lsf-pc, Linux MM, Michal Hocko, Dan Williams, Dave Hansen,
	David Rientjes, Wei Xu, Greg Thelen



On 6/17/21 11:48 AM, Shakeel Butt wrote:
> Thanks Yang for the CC.
> 
> On Tue, Jun 15, 2021 at 5:17 PM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>>
>>>
>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>>
>>> Tiered memory accounting and management
>>> ------------------------------------------------------------
>>> Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
>>> than others, but a byte of media has about the same cost whether it
>>> is close or far.  But, with new memory tiers such as High-Bandwidth
>>> Memory or Persistent Memory, there is a choice between fast/expensive
>>> and slow/cheap.  But, the current memory cgroups still live in the
>>> old model. There is only one set of limits, and it implies that all
>>> memory has the same cost.  We would like to extend memory cgroups to
>>> comprehend different memory tiers to give users a way to choose a mix
>>> between fast/expensive and slow/cheap.
>>>
>>> To manage such memory, we will need to account memory usage and
>>> impose limits for each kind of memory.
>>>
>>> There were a couple of approaches that have been discussed previously to partition
>>> the memory between the cgroups listed below.  We will like to
>>> use the LSF/MM session to come to a consensus on the approach to
>>> take.
>>>
>>> 1.      Per NUMA node limit and accounting for each cgroup.
>>> We can assign higher limits on better performing memory node for higher priority cgroups.
>>>
>>> There are some loose ends here that warrant further discussions:
>>> (1) A user friendly interface for such limits.  Will a proportional
>>> weight for the cgroup that translate to actual absolute limit be more suitable?
>>> (2) Memory mis-configurations can occur more easily as the admin
>>> has a much larger number of limits spread among between the
>>> cgroups to manage.  Over-restrictive limits can lead to under utilized
>>> and wasted memory and hurt performance.
>>> (3) OOM behavior when a cgroup hits its limit.
>>>
> 
> This (numa based limits) is something I was pushing for but after
> discussing this internally with userspace controller devs, I have to
> backoff from this position.
> 
> The main feedback I got was that setting one memory limit is already
> complicated and having to set/adjust these many limits would be
> horrifying.
> 
>>> 2.      Per memory tier limit and accounting for each cgroup.
>>> We can assign higher limits on memories in better performing
>>> memory tier for higher priority cgroups.  I previously
>>> prototyped a soft limit based implementation to demonstrate the
>>> tiered limit idea.
>>>
>>> There are also a number of issues here:
>>> (1)     The advantage is we have fewer limits to deal with simplifying
>>> configuration. However, there are doubts raised by a number
>>> of people on whether we can really properly classify the NUMA
>>> nodes into memory tiers. There could still be significant performance
>>> differences between NUMA nodes even for the same kind of memory.
>>> We will also not have the fine-grained control and flexibility that comes
>>> with a per NUMA node limit.
>>> (2)     Will a memory hierarchy defined by promotion/demotion relationship between
>>> memory nodes be a viable approach for defining memory tiers?
>>>
>>> These issues related to  the management of systems with multiple kind of memories
>>> can be ironed out in this session.
>>
>> Thanks for suggesting this topic. I'm interested in the topic and
>> would like to attend.
>>
>> Other than the above points. I'm wondering whether we shall discuss
>> "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the
>> development and I have been involved in the early development and
>> review, but it seems there are still some open questions according to
>> the latest review feedback.
>>
>> Some other folks may be interested in this topic either, CC'ed them in
>> the thread.
>>
> 
> At the moment "personally" I am more inclined towards a passive
> approach towards the memcg accounting of memory tiers. By that I mean,
> let's start by providing a 'usage' interface and get more
> production/real-world data to motivate the 'limit' interfaces. (One
> minor reason is that defining the 'limit' interface will force us to
> make the decision on defining tiers i.e. numa or a set of numa or
> others).

Probably we could first start with accounting the memory used in each
NUMA node for a cgroup and exposing this information to user space.  
I think that is useful regardless.

There is still a question of whether we want to define a set of
numa node or tier and extend the accounting and management at that
memory tier abstraction level.
 
> 
> IMHO we should focus more on the "aging" of the application memory and
> "migration/balance" between the tiers. I don't think the memory
> reclaim infrastructure is the right place for these operations
> (unevictable pages are ignored and not accurate ages). What we need is
> proactive continuous aging and balancing. We need something like, with
> additions, Multi-gen LRUs or DAMON or page idle tracking for aging and
> a new mechanism for balancing which takes ages into account.

Multi-gen LRUs will be pretty useful to expose the page warmth in a NUMA
node and to target the right page to reclaim for a memcg. We will also need some
way to determine how many pages to target in each memcg for a reclaim.

> 
> To give a more concrete example: Let's say we have a system with two
> memory tiers and multiple low and high priority jobs. For high
> priority jobs, set the allocation try list from high to low tier and
> for low priority jobs the reverse of that (I am not sure if we can do
> that out of the box with today's kernel). In the background we migrate
> cold memory down the tiers and hot memory in the reverse direction.
> 
> In this background mechanism we can enforce all different limiting
> policies like Yang's original high and low tier percentage or
> something like X% of accesses of high priority jobs should be from
> high tier. 

If I understand what you are saying is you desire the kernel to provide
the interface to expose performance information like 
"X% of accesses of high priority jobs is from high tier",
and knobs for user space to tell kernel to re-balance pages on
a per job class (or cgroup) basis based on this information.
The page re-balancing will be initiated by user space rather than
by the kernel, similar to what Wei proposed.
 

> Basically I am saying until we find from production data
> that this background mechanism is not strong enough to enforce passive
> limits, we should delay the decision on limit interfaces.
>

Implementing hard limit does have a number of rough edges
on a per node basis.  Probably we should first start with doing the
proper accounting and exposing the right performance information.


Tim


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Tiered memory accounting and management
  2021-06-18 22:11     ` Tim Chen
@ 2021-06-18 23:59       ` Shakeel Butt
  2021-06-19  0:56         ` Tim Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Shakeel Butt @ 2021-06-18 23:59 UTC (permalink / raw)
  To: Tim Chen
  Cc: Yang Shi, lsf-pc, Linux MM, Michal Hocko, Dan Williams,
	Dave Hansen, David Rientjes, Wei Xu, Greg Thelen

On Fri, Jun 18, 2021 at 3:11 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 6/17/21 11:48 AM, Shakeel Butt wrote:
[...]
> >
> > At the moment "personally" I am more inclined towards a passive
> > approach towards the memcg accounting of memory tiers. By that I mean,
> > let's start by providing a 'usage' interface and get more
> > production/real-world data to motivate the 'limit' interfaces. (One
> > minor reason is that defining the 'limit' interface will force us to
> > make the decision on defining tiers i.e. numa or a set of numa or
> > others).
>
> Probably we could first start with accounting the memory used in each
> NUMA node for a cgroup and exposing this information to user space.
> I think that is useful regardless.
>

Is memory.numa_stat not good enough? This interface does miss
__GFP_ACCOUNT non-slab allocations, percpu and sock.

> There is still a question of whether we want to define a set of
> numa node or tier and extend the accounting and management at that
> memory tier abstraction level.
>
[...]
> >
> > To give a more concrete example: Let's say we have a system with two
> > memory tiers and multiple low and high priority jobs. For high
> > priority jobs, set the allocation try list from high to low tier and
> > for low priority jobs the reverse of that (I am not sure if we can do
> > that out of the box with today's kernel). In the background we migrate
> > cold memory down the tiers and hot memory in the reverse direction.
> >
> > In this background mechanism we can enforce all different limiting
> > policies like Yang's original high and low tier percentage or
> > something like X% of accesses of high priority jobs should be from
> > high tier.
>
> If I understand what you are saying is you desire the kernel to provide
> the interface to expose performance information like
> "X% of accesses of high priority jobs is from high tier",

I think we can estimate "X% of accesses to high tier" using existing
perf/PMU counters. So, no new interface.

> and knobs for user space to tell kernel to re-balance pages on
> a per job class (or cgroup) basis based on this information.
> The page re-balancing will be initiated by user space rather than
> by the kernel, similar to what Wei proposed.

This is more open to discussion and we should brainstorm the pros and
cons of all proposed approaches.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Tiered memory accounting and management
  2021-06-18 23:59       ` Shakeel Butt
@ 2021-06-19  0:56         ` Tim Chen
  2021-06-19  1:17           ` Shakeel Butt
  0 siblings, 1 reply; 9+ messages in thread
From: Tim Chen @ 2021-06-19  0:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yang Shi, lsf-pc, Linux MM, Michal Hocko, Dan Williams,
	Dave Hansen, David Rientjes, Wei Xu, Greg Thelen



On 6/18/21 4:59 PM, Shakeel Butt wrote:
> On Fri, Jun 18, 2021 at 3:11 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>>
>>
>> On 6/17/21 11:48 AM, Shakeel Butt wrote:
> [...]
>>>
>>> At the moment "personally" I am more inclined towards a passive
>>> approach towards the memcg accounting of memory tiers. By that I mean,
>>> let's start by providing a 'usage' interface and get more
>>> production/real-world data to motivate the 'limit' interfaces. (One
>>> minor reason is that defining the 'limit' interface will force us to
>>> make the decision on defining tiers i.e. numa or a set of numa or
>>> others).
>>
>> Probably we could first start with accounting the memory used in each
>> NUMA node for a cgroup and exposing this information to user space.
>> I think that is useful regardless.
>>
> 
> Is memory.numa_stat not good enough? 

Yeah, forgot numa_stat is already there.  Thanks for reminding me.

> This interface does miss
> __GFP_ACCOUNT non-slab allocations, percpu and sock.

numa_stat should be good enough for now.

> 
>> There is still a question of whether we want to define a set of
>> numa node or tier and extend the accounting and management at that
>> memory tier abstraction level.
>>
> [...]
>>>
>>> To give a more concrete example: Let's say we have a system with two
>>> memory tiers and multiple low and high priority jobs. For high
>>> priority jobs, set the allocation try list from high to low tier and
>>> for low priority jobs the reverse of that (I am not sure if we can do
>>> that out of the box with today's kernel). In the background we migrate
>>> cold memory down the tiers and hot memory in the reverse direction.
>>>
>>> In this background mechanism we can enforce all different limiting
>>> policies like Yang's original high and low tier percentage or
>>> something like X% of accesses of high priority jobs should be from
>>> high tier.
>>
>> If I understand what you are saying is you desire the kernel to provide
>> the interface to expose performance information like
>> "X% of accesses of high priority jobs is from high tier",
> 
> I think we can estimate "X% of accesses to high tier" using existing
> perf/PMU counters. So, no new interface.

Using a perf counter will be okay to do for user space daemon, but I
think there will be objections from people that the kernel 
take away a perf counter to collect perf data in kernel.

Tim


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Tiered memory accounting and management
  2021-06-19  0:56         ` Tim Chen
@ 2021-06-19  1:17           ` Shakeel Butt
  0 siblings, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2021-06-19  1:17 UTC (permalink / raw)
  To: Tim Chen
  Cc: Yang Shi, lsf-pc, Linux MM, Michal Hocko, Dan Williams,
	Dave Hansen, David Rientjes, Wei Xu, Greg Thelen

On Fri, Jun 18, 2021 at 5:56 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
[...]
> >>
> >> If I understand what you are saying is you desire the kernel to provide
> >> the interface to expose performance information like
> >> "X% of accesses of high priority jobs is from high tier",
> >
> > I think we can estimate "X% of accesses to high tier" using existing
> > perf/PMU counters. So, no new interface.
>
> Using a perf counter will be okay to do for user space daemon, but I
> think there will be objections from people that the kernel
> take away a perf counter to collect perf data in kernel.
>

This is one possible policy. I would not focus too much on it unless
someone says they want exactly that. In that case we can brainstorm
how to provide general infrastructure to enforce such policies.
Basically this is like an SLO and the violation triggers the balancing
(which can be in use space or kernel and a separate discussion).


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Tiered memory accounting and management
  2021-06-17 18:48   ` Shakeel Butt
  2021-06-18 22:11     ` Tim Chen
@ 2021-06-21 20:42     ` Yang Shi
  2021-06-21 21:23       ` Shakeel Butt
  1 sibling, 1 reply; 9+ messages in thread
From: Yang Shi @ 2021-06-21 20:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tim Chen, lsf-pc, Linux MM, Michal Hocko, Dan Williams,
	Dave Hansen, David Rientjes, Wei Xu, Greg Thelen

On Thu, Jun 17, 2021 at 11:49 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Thanks Yang for the CC.
>
> On Tue, Jun 15, 2021 at 5:17 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > >
> > >
> > > From: Tim Chen <tim.c.chen@linux.intel.com>
> > >
> > > Tiered memory accounting and management
> > > ------------------------------------------------------------
> > > Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
> > > than others, but a byte of media has about the same cost whether it
> > > is close or far.  But, with new memory tiers such as High-Bandwidth
> > > Memory or Persistent Memory, there is a choice between fast/expensive
> > > and slow/cheap.  But, the current memory cgroups still live in the
> > > old model. There is only one set of limits, and it implies that all
> > > memory has the same cost.  We would like to extend memory cgroups to
> > > comprehend different memory tiers to give users a way to choose a mix
> > > between fast/expensive and slow/cheap.
> > >
> > > To manage such memory, we will need to account memory usage and
> > > impose limits for each kind of memory.
> > >
> > > There were a couple of approaches that have been discussed previously to partition
> > > the memory between the cgroups listed below.  We will like to
> > > use the LSF/MM session to come to a consensus on the approach to
> > > take.
> > >
> > > 1.      Per NUMA node limit and accounting for each cgroup.
> > > We can assign higher limits on better performing memory node for higher priority cgroups.
> > >
> > > There are some loose ends here that warrant further discussions:
> > > (1) A user friendly interface for such limits.  Will a proportional
> > > weight for the cgroup that translate to actual absolute limit be more suitable?
> > > (2) Memory mis-configurations can occur more easily as the admin
> > > has a much larger number of limits spread among between the
> > > cgroups to manage.  Over-restrictive limits can lead to under utilized
> > > and wasted memory and hurt performance.
> > > (3) OOM behavior when a cgroup hits its limit.
> > >
>
> This (numa based limits) is something I was pushing for but after
> discussing this internally with userspace controller devs, I have to
> backoff from this position.
>
> The main feedback I got was that setting one memory limit is already
> complicated and having to set/adjust these many limits would be
> horrifying.

Yes, that is also what I heard of.

>
> > > 2.      Per memory tier limit and accounting for each cgroup.
> > > We can assign higher limits on memories in better performing
> > > memory tier for higher priority cgroups.  I previously
> > > prototyped a soft limit based implementation to demonstrate the
> > > tiered limit idea.
> > >
> > > There are also a number of issues here:
> > > (1)     The advantage is we have fewer limits to deal with simplifying
> > > configuration. However, there are doubts raised by a number
> > > of people on whether we can really properly classify the NUMA
> > > nodes into memory tiers. There could still be significant performance
> > > differences between NUMA nodes even for the same kind of memory.
> > > We will also not have the fine-grained control and flexibility that comes
> > > with a per NUMA node limit.
> > > (2)     Will a memory hierarchy defined by promotion/demotion relationship between
> > > memory nodes be a viable approach for defining memory tiers?
> > >
> > > These issues related to  the management of systems with multiple kind of memories
> > > can be ironed out in this session.
> >
> > Thanks for suggesting this topic. I'm interested in the topic and
> > would like to attend.
> >
> > Other than the above points. I'm wondering whether we shall discuss
> > "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the
> > development and I have been involved in the early development and
> > review, but it seems there are still some open questions according to
> > the latest review feedback.
> >
> > Some other folks may be interested in this topic either, CC'ed them in
> > the thread.
> >
>
> At the moment "personally" I am more inclined towards a passive
> approach towards the memcg accounting of memory tiers. By that I mean,
> let's start by providing a 'usage' interface and get more
> production/real-world data to motivate the 'limit' interfaces. (One
> minor reason is that defining the 'limit' interface will force us to
> make the decision on defining tiers i.e. numa or a set of numa or
> others).
>
> IMHO we should focus more on the "aging" of the application memory and
> "migration/balance" between the tiers. I don't think the memory
> reclaim infrastructure is the right place for these operations
> (unevictable pages are ignored and not accurate ages). What we need is

Why is unevictable pages a problem? I don't get why you have to demote
unevictable pages. If you do care what nodes the memory will be
mlock'ed on, don't you have to move the memory to the target nodes
before mlock them?

> proactive continuous aging and balancing. We need something like, with
> additions, Multi-gen LRUs or DAMON or page idle tracking for aging and
> a new mechanism for balancing which takes ages into account.

I agree the better balance could be reached by more accurate aging. It
is a more general problem than tier'ed memory specific.

>
> To give a more concrete example: Let's say we have a system with two
> memory tiers and multiple low and high priority jobs. For high
> priority jobs, set the allocation try list from high to low tier and
> for low priority jobs the reverse of that (I am not sure if we can do
> that out of the box with today's kernel). In the background we migrate

AFAICT, I don't think we have. With the current APIs, you just can
bind to a set of nodes, but the fallback order is one way.

> cold memory down the tiers and hot memory in the reverse direction.
>
> In this background mechanism we can enforce all different limiting
> policies like Yang's original high and low tier percentage or
> something like X% of accesses of high priority jobs should be from
> high tier. Basically I am saying until we find from production data
> that this background mechanism is not strong enough to enforce passive
> limits, we should delay the decision on limit interfaces.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Tiered memory accounting and management
  2021-06-21 20:42     ` Yang Shi
@ 2021-06-21 21:23       ` Shakeel Butt
  0 siblings, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2021-06-21 21:23 UTC (permalink / raw)
  To: Yang Shi
  Cc: Tim Chen, lsf-pc, Linux MM, Michal Hocko, Dan Williams,
	Dave Hansen, David Rientjes, Wei Xu, Greg Thelen

On Mon, Jun 21, 2021 at 1:43 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Jun 17, 2021 at 11:49 AM Shakeel Butt <shakeelb@google.com> wrote:
[...]
> >
> > IMHO we should focus more on the "aging" of the application memory and
> > "migration/balance" between the tiers. I don't think the memory
> > reclaim infrastructure is the right place for these operations
> > (unevictable pages are ignored and not accurate ages). What we need is
>
> Why is unevictable pages a problem? I don't get why you have to demote
> unevictable pages. If you do care what nodes the memory will be
> mlock'ed on, don't you have to move the memory to the target nodes
> before mlock them?
>

I think we want the ability to balance the memory (hot in higher tier
and cold in lower tier) irrespective if it is evictable or not.
Similarly we want aging information of both evictable and unevictable
memory. If we depend on the reclaim infrastructure for demotion then
cold unevictable memory may get stuck in the higher tier and have no
aging information of unevictable memory.

> > proactive continuous aging and balancing. We need something like, with
> > additions, Multi-gen LRUs or DAMON or page idle tracking for aging and
> > a new mechanism for balancing which takes ages into account.
>
> I agree the better balance could be reached by more accurate aging. It
> is a more general problem than tier'ed memory specific.
>

I agree and proactive reclaim is the other use-case.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-21 21:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-14 21:51 [LSF/MM TOPIC] Tiered memory accounting and management Tim Chen
2021-06-16  0:17 ` Yang Shi
2021-06-17 18:48   ` Shakeel Butt
2021-06-18 22:11     ` Tim Chen
2021-06-18 23:59       ` Shakeel Butt
2021-06-19  0:56         ` Tim Chen
2021-06-19  1:17           ` Shakeel Butt
2021-06-21 20:42     ` Yang Shi
2021-06-21 21:23       ` Shakeel Butt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.