All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages
@ 2023-02-17 11:58 Aneesh Kumar K V
  2023-02-17 16:42 ` SeongJae Park
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Aneesh Kumar K V @ 2023-02-17 11:58 UTC (permalink / raw)
  To: lsf-pc, Linux MM; +Cc: Yu Zhao, Dave Hansen, Johannes Weiner

[-- Attachment #1: Type: text/plain, Size: 1208 bytes --]

PowerPC architecture (POWER10) supports a Hot/Cold page tracking
facility that provides access counter and access affinity details at
configurable page size granularity [1]. I have been looking at using
this counter in different areas of the kernel such as

1) Page reclaim/demotion
2) THP utilization
3) Page promotion.

I have done some MGLRU integration and would like to discuss the
observation with the rest of the community. It is still not clear what
are the best ways to integrate these hardware counters in the Linux
kernel. Attached is the performance graph showing how the mongodb/ycsb
benchmark performs when using hardware counters with MGLRU aging. An
early RFC version of the code can be found at
https://github.com/kvaneesh/linux/commit/b472e2c8080823bb4114c286270aea3e18ffe221
. I also expect we can get some numbers w.r.t THP usage before the
conference.


X axis is the amount of memory that I am removing from the system so
that I can force more memory reclaims. The total memory available is
50GB/single NUMA node/64 CPUs,40GB database with 40GB cache
configuration.


[1]
https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_IBM_Starke_POWER10_v33.pdf

[-- Attachment #2: mongodb-perf-lsf-mm.png --]
[-- Type: image/png, Size: 108289 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages
  2023-02-17 11:58 [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages Aneesh Kumar K V
@ 2023-02-17 16:42 ` SeongJae Park
  2023-02-19 14:29   ` Aneesh Kumar K.V
  2023-02-17 16:53 ` Matthew Wilcox
  2023-02-17 22:00 ` Yang Shi
  2 siblings, 1 reply; 8+ messages in thread
From: SeongJae Park @ 2023-02-17 16:42 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: lsf-pc, Linux MM, Yu Zhao, Dave Hansen, Johannes Weiner, damon

Hi Aneesh,

On Fri, 17 Feb 2023 17:28:09 +0530 Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:

> PowerPC architecture (POWER10) supports a Hot/Cold page tracking
> facility that provides access counter and access affinity details at
> configurable page size granularity [1]. I have been looking at using
> this counter in different areas of the kernel such as
> 
> 1) Page reclaim/demotion
> 2) THP utilization
> 3) Page promotion.
> 
> I have done some MGLRU integration and would like to discuss the
> observation with the rest of the community. It is still not clear what
> are the best ways to integrate these hardware counters in the Linux
> kernel.

Sounds very interesting.  I think DAMON might be one another option, because it
is designed to be easy to extended with various source of access
information[1], and provides an abstraction layer for access temparature based
memory management[2], namely Data Access Monitoring-based Operation Schemes
(DAMOS).

> Attached is the performance graph showing how the mongodb/ycsb
> benchmark performs when using hardware counters with MGLRU aging. An
> early RFC version of the code can be found at
> https://github.com/kvaneesh/linux/commit/b472e2c8080823bb4114c286270aea3e18ffe221
> . I also expect we can get some numbers w.r.t THP usage before the
> conference.

I also have experimented a DAMON-based THP optimization[3], which shown
interesting results.

Hope to discuss about this with you at LSF/MM.  FYI, I also proposed an LSF/MM
topic for DAMON[4].

[1] https://docs.kernel.org/mm/damon/design.html#configurable-layers
[2] https://docs.kernel.org/mm/damon/api.html#c.damos
[3] https://www.amazon.science/publications/daos-data-access-aware-operating-system
[4] https://lore.kernel.org/damon/20230214003328.55285-1-sj@kernel.org/


Thanks,
SJ

> 
> 
> X axis is the amount of memory that I am removing from the system so
> that I can force more memory reclaims. The total memory available is
> 50GB/single NUMA node/64 CPUs,40GB database with 40GB cache
> configuration.
> 
> 
> [1]
> https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_IBM_Starke_POWER10_v33.pdf
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages
  2023-02-17 11:58 [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages Aneesh Kumar K V
  2023-02-17 16:42 ` SeongJae Park
@ 2023-02-17 16:53 ` Matthew Wilcox
  2023-02-19 14:43   ` Aneesh Kumar K.V
  2023-02-17 22:00 ` Yang Shi
  2 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2023-02-17 16:53 UTC (permalink / raw)
  To: Aneesh Kumar K V; +Cc: lsf-pc, Linux MM, Yu Zhao, Dave Hansen, Johannes Weiner

On Fri, Feb 17, 2023 at 05:28:09PM +0530, Aneesh Kumar K V wrote:
> PowerPC architecture (POWER10) supports a Hot/Cold page tracking
> facility that provides access counter and access affinity details at
> configurable page size granularity [1]. I have been looking at using

Does that advert contain any more information about this feature than:

	Hot/Cold page tracking | Recording for memory management

because I'd like to understand what its limitations are -- can
it be a per-VMA option, for example?  Or is it set at bootup like
CONFIG_PAGE_SIZE?

For file-backed memory, the page cache will use variable sized
folios, depending on what it determines to be a useful granularity.
I'm _expecting_ something of the same sort for anonymous memory, although
maybe we'll make that determination on a per-VMA basis and make all
folios within a VMA the same size.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages
  2023-02-17 11:58 [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages Aneesh Kumar K V
  2023-02-17 16:42 ` SeongJae Park
  2023-02-17 16:53 ` Matthew Wilcox
@ 2023-02-17 22:00 ` Yang Shi
  2023-02-19 14:45   ` Aneesh Kumar K.V
  2 siblings, 1 reply; 8+ messages in thread
From: Yang Shi @ 2023-02-17 22:00 UTC (permalink / raw)
  To: Aneesh Kumar K V; +Cc: lsf-pc, Linux MM, Yu Zhao, Dave Hansen, Johannes Weiner

On Fri, Feb 17, 2023 at 3:58 AM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> PowerPC architecture (POWER10) supports a Hot/Cold page tracking
> facility that provides access counter and access affinity details at
> configurable page size granularity [1]. I have been looking at using
> this counter in different areas of the kernel such as
>
> 1) Page reclaim/demotion
> 2) THP utilization
> 3) Page promotion.

Not sure whether you are aware of this patchset:
https://lore.kernel.org/linux-mm/20230208073533.715-1-bharata@amd.com/

ARM64 has SPE which provides similar functionality. So I hope a common
framework could be provided to hide the hardware details.

>
> I have done some MGLRU integration and would like to discuss the
> observation with the rest of the community. It is still not clear what
> are the best ways to integrate these hardware counters in the Linux
> kernel. Attached is the performance graph showing how the mongodb/ycsb
> benchmark performs when using hardware counters with MGLRU aging. An
> early RFC version of the code can be found at
> https://github.com/kvaneesh/linux/commit/b472e2c8080823bb4114c286270aea3e18ffe221
> . I also expect we can get some numbers w.r.t THP usage before the
> conference.
>
>
> X axis is the amount of memory that I am removing from the system so
> that I can force more memory reclaims. The total memory available is
> 50GB/single NUMA node/64 CPUs,40GB database with 40GB cache
> configuration.
>
>
> [1]
> https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_IBM_Starke_POWER10_v33.pdf


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages
  2023-02-17 16:42 ` SeongJae Park
@ 2023-02-19 14:29   ` Aneesh Kumar K.V
  2023-02-19 20:31     ` SeongJae Park
  0 siblings, 1 reply; 8+ messages in thread
From: Aneesh Kumar K.V @ 2023-02-19 14:29 UTC (permalink / raw)
  To: SeongJae Park
  Cc: lsf-pc, Linux MM, Yu Zhao, Dave Hansen, Johannes Weiner, damon

SeongJae Park <sj@kernel.org> writes:

> Hi Aneesh,
>
> On Fri, 17 Feb 2023 17:28:09 +0530 Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
>
>> PowerPC architecture (POWER10) supports a Hot/Cold page tracking
>> facility that provides access counter and access affinity details at
>> configurable page size granularity [1]. I have been looking at using
>> this counter in different areas of the kernel such as
>> 
>> 1) Page reclaim/demotion
>> 2) THP utilization
>> 3) Page promotion.
>> 
>> I have done some MGLRU integration and would like to discuss the
>> observation with the rest of the community. It is still not clear what
>> are the best ways to integrate these hardware counters in the Linux
>> kernel.
>
> Sounds very interesting.  I think DAMON might be one another option, because it
> is designed to be easy to extended with various source of access
> information[1], and provides an abstraction layer for access temparature based
> memory management[2], namely Data Access Monitoring-based Operation Schemes
> (DAMOS).
>
>> Attached is the performance graph showing how the mongodb/ycsb
>> benchmark performs when using hardware counters with MGLRU aging. An
>> early RFC version of the code can be found at
>> https://github.com/kvaneesh/linux/commit/b472e2c8080823bb4114c286270aea3e18ffe221
>> . I also expect we can get some numbers w.r.t THP usage before the
>> conference.
>
> I also have experimented a DAMON-based THP optimization[3], which shown
> interesting results.
>
> Hope to discuss about this with you at LSF/MM.  FYI, I also proposed an LSF/MM
> topic for DAMON[4].
>
> [1] https://docs.kernel.org/mm/damon/design.html#configurable-layers
> [2] https://docs.kernel.org/mm/damon/api.html#c.damos
> [3] https://www.amazon.science/publications/daos-data-access-aware-operating-system
> [4] https://lore.kernel.org/damon/20230214003328.55285-1-sj@kernel.org/
>
>

The hardware counters that are supported in the case of POWER10 are
based on physical addresses. The hardware facility will count the access
across a physical address range and there is a counter for each page
that gives the access count and also information about which node did
access the page.

I haven't spent much time studying DAMON so I might be wrong here. The
reason I avoided using DAMON for the POC was because my goal was to
evaluate how the hardware counters measured against the pte reference
bit and I was not sure I could evaluate that using the DAMON action
facility.

I do agree that we could add a layer similar to DAMON_PADDR and expose
the details to userspace. But I was not sure we can take action based on
that. In most cases what I wanted was to move the coldest page in the
Numa node to swap. So that is relative hotness rather than moving a page
that got a hotness value less than 10 to swap even though we can figure
out a way to make the latter similar to the former.


I will look at DAMON and see if that is the best framework for things
like this.

-aneesh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages
  2023-02-17 16:53 ` Matthew Wilcox
@ 2023-02-19 14:43   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 8+ messages in thread
From: Aneesh Kumar K.V @ 2023-02-19 14:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, Linux MM, Yu Zhao, Dave Hansen, Johannes Weiner

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Feb 17, 2023 at 05:28:09PM +0530, Aneesh Kumar K V wrote:
>> PowerPC architecture (POWER10) supports a Hot/Cold page tracking
>> facility that provides access counter and access affinity details at
>> configurable page size granularity [1]. I have been looking at using
>
> Does that advert contain any more information about this feature than:
>
> 	Hot/Cold page tracking | Recording for memory management
>

I will work with the hardware team to see if I can get a writeup done
for use before the conference. But I am also interested in discussing
things like who bears the cost of action based on hotness. Since a
facility like this operates at the physical address range we may mostly
be doing this outside the process context. For example, I could see the
possibility of kpromoted which looks at the youngest generation in MGLRU
and based on relative hotness move hot pages to the NUMA node from which
there is frequent access. Should kpromoted do the migration? Or should
it mark the pages migration ready (something like prot NUMA) and task on
next access migrate the page?


One of the other challenges I run into is determining the relative
hotness. In most cases what we need is relative hotness not the absolute
access count of a page. I also noticed that with the mongodb test, the
performance varies a lot based on how we determine the relative hotness.


> because I'd like to understand what its limitations are -- can
> it be a per-VMA option, for example?  Or is it set at bootup like
> CONFIG_PAGE_SIZE?

The hardware counters that are supported in the case of POWER10 are
based on physical addresses. The hardware facility will count the access
across a physical address range and there is a counter for each page
that gives the access count and also information about which node did
access the page. The page size is configurable and in POC I did use
CONFIG_PAGE_SIZE. There is overhead in enabling/disabling the facility
and I haven't looked at doing things like that in something like context
switch granularity. Also, it monitors a physical address range and I am
not sure how we can make that work for a VMA range or a task address
space.

>
> For file-backed memory, the page cache will use variable sized
> folios, depending on what it determines to be a useful granularity.
> I'm _expecting_ something of the same sort for anonymous memory, although
> maybe we'll make that determination on a per-VMA basis and make all
> folios within a VMA the same size.

-aneesh


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages
  2023-02-17 22:00 ` Yang Shi
@ 2023-02-19 14:45   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 8+ messages in thread
From: Aneesh Kumar K.V @ 2023-02-19 14:45 UTC (permalink / raw)
  To: Yang Shi; +Cc: lsf-pc, Linux MM, Yu Zhao, Dave Hansen, Johannes Weiner

Yang Shi <shy828301@gmail.com> writes:

> On Fri, Feb 17, 2023 at 3:58 AM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> PowerPC architecture (POWER10) supports a Hot/Cold page tracking
>> facility that provides access counter and access affinity details at
>> configurable page size granularity [1]. I have been looking at using
>> this counter in different areas of the kernel such as
>>
>> 1) Page reclaim/demotion
>> 2) THP utilization
>> 3) Page promotion.
>
> Not sure whether you are aware of this patchset:
> https://lore.kernel.org/linux-mm/20230208073533.715-1-bharata@amd.com/
>
> ARM64 has SPE which provides similar functionality. So I hope a common
> framework could be provided to hide the hardware details.

I will look at this discussion and see if there are some details I can reuse.

-aneesh


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages
  2023-02-19 14:29   ` Aneesh Kumar K.V
@ 2023-02-19 20:31     ` SeongJae Park
  0 siblings, 0 replies; 8+ messages in thread
From: SeongJae Park @ 2023-02-19 20:31 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: SeongJae Park, lsf-pc, Linux MM, Yu Zhao, Dave Hansen,
	Johannes Weiner, damon

Hi Aneesh,

On Sun, 19 Feb 2023 19:59:47 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> SeongJae Park <sj@kernel.org> writes:
> 
> > Hi Aneesh,
> >
> > On Fri, 17 Feb 2023 17:28:09 +0530 Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
> >
> >> PowerPC architecture (POWER10) supports a Hot/Cold page tracking
> >> facility that provides access counter and access affinity details at
> >> configurable page size granularity [1]. I have been looking at using
> >> this counter in different areas of the kernel such as
> >> 
> >> 1) Page reclaim/demotion
> >> 2) THP utilization
> >> 3) Page promotion.
> >> 
> >> I have done some MGLRU integration and would like to discuss the
> >> observation with the rest of the community. It is still not clear what
> >> are the best ways to integrate these hardware counters in the Linux
> >> kernel.
> >
> > Sounds very interesting.  I think DAMON might be one another option, because it
> > is designed to be easy to extended with various source of access
> > information[1], and provides an abstraction layer for access temparature based
> > memory management[2], namely Data Access Monitoring-based Operation Schemes
> > (DAMOS).
> >
> >> Attached is the performance graph showing how the mongodb/ycsb
> >> benchmark performs when using hardware counters with MGLRU aging. An
> >> early RFC version of the code can be found at
> >> https://github.com/kvaneesh/linux/commit/b472e2c8080823bb4114c286270aea3e18ffe221
> >> . I also expect we can get some numbers w.r.t THP usage before the
> >> conference.
> >
> > I also have experimented a DAMON-based THP optimization[3], which shown
> > interesting results.
> >
> > Hope to discuss about this with you at LSF/MM.  FYI, I also proposed an LSF/MM
> > topic for DAMON[4].
> >
> > [1] https://docs.kernel.org/mm/damon/design.html#configurable-layers
> > [2] https://docs.kernel.org/mm/damon/api.html#c.damos
> > [3] https://www.amazon.science/publications/daos-data-access-aware-operating-system
> > [4] https://lore.kernel.org/damon/20230214003328.55285-1-sj@kernel.org/
> >
> >
> 
> The hardware counters that are supported in the case of POWER10 are
> based on physical addresses. The hardware facility will count the access
> across a physical address range and there is a counter for each page
> that gives the access count and also information about which node did
> access the page.
> 
> I haven't spent much time studying DAMON so I might be wrong here. The
> reason I avoided using DAMON for the POC was because my goal was to
> evaluate how the hardware counters measured against the pte reference
> bit and I was not sure I could evaluate that using the DAMON action
> facility.
> 
> I do agree that we could add a layer similar to DAMON_PADDR and expose
> the details to userspace. But I was not sure we can take action based on
> that. In most cases what I wanted was to move the coldest page in the
> Numa node to swap. So that is relative hotness rather than moving a page
> that got a hotness value less than 10 to swap even though we can figure
> out a way to make the latter similar to the former.

I think you could use DAMOS quota[1].  Under the quota, DAMOS prioritizes
regions based on the access pattern, and applies the action to higher-priority
regions, so you can do relative coldness-based reclamation, like DAMON_RECLAIM
does[2].

The interface might not very efficient for your specific case, though.  I want
to know such cases and improve the interface or implement new features.

[1] https://docs.kernel.org/mm/damon/api.html#c.damos_quota
[2] https://docs.kernel.org/admin-guide/mm/damon/reclaim.html#quota-sz

> 
> 
> I will look at DAMON and see if that is the best framework for things
> like this.

Great, please feel free to reach out to me if you have any question or need
help.


Thanks,
SJ

> 
> -aneesh

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-02-19 20:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-17 11:58 [LSF/MM/BPF TOPIC] Using hardware counters to determine hot/cold pages Aneesh Kumar K V
2023-02-17 16:42 ` SeongJae Park
2023-02-19 14:29   ` Aneesh Kumar K.V
2023-02-19 20:31     ` SeongJae Park
2023-02-17 16:53 ` Matthew Wilcox
2023-02-19 14:43   ` Aneesh Kumar K.V
2023-02-17 22:00 ` Yang Shi
2023-02-19 14:45   ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.