[RFC] mm: drop mark_page_access from the unmap path
diff mbox series

Message ID 20190809124305.GQ18351@dhcp22.suse.cz
State New
Headers show
Series
  • [RFC] mm: drop mark_page_access from the unmap path
Related show

Commit Message

Michal Hocko Aug. 9, 2019, 12:43 p.m. UTC
On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
[...]
> > > As Nick mentioned in the description, without mark_page_accessed in
> > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > while several read(2) calls easily promote it.
> > 
> > And is this really a problem? If we refault the same page then the
> > refaults detection should catch it no? In other words is the above still
> > a problem these days?
> 
> I admit we have been not fair for them because read(2) syscall pages are
> easily promoted regardless of zap timing unlike mmap-based pages.
> 
> However, if we remove the mark_page_accessed in the zap_pte_range, it
> would make them more unfair in that read(2)-accessed pages are easily
> promoted while mmap-based page should go through refault to be promoted.

I have really hard time to follow why an unmap special handling is
making the overall state more reasonable.

Anyway, let me throw the patch for further discussion. Nick, Mel,
Johannes what do you think?

From 3821c2e66347a2141358cabdc6224d9990276fec Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 9 Aug 2019 14:29:59 +0200
Subject: [PATCH] mm: drop mark_page_access from the unmap path

Minchan has noticed that mark_page_access can take quite some time
during unmap:
: I had a time to benchmark it via adding some trace_printk hooks between
: pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
: device is 2018 premium mobile device.
:
: I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
: task runs on little core even though it doesn't have any IPI and LRU
: lock contention. It's already too heavy.
:
: If I remove activate_page, 35-40% overhead of zap_pte_range is gone
: so most of overhead(about 0.7ms) comes from activate_page via
: mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
: accumulate up to several ms.

bf3f3bc5e734 ("mm: don't mark_page_accessed in fault path") has replaced
SetPageReferenced by mark_page_accessed arguing that the former is not
sufficient when mark_page_accessed is removed from the fault path
because it doesn't promote page to the active list. It is true that a
page that is mapped by a single process might not get promoted even when
referenced if the reclaim checks it after the unmap but does that matter
that much? Can we cosider the page hot if there are no other
users? Moreover we do have workingset detection in place since then and
so a next refault would activate the page if it was really hot one.

Drop the expensive mark_page_accessed and restore SetPageReferenced to
transfer the reference information into the struct page for now to
reduce the unmap overhead. Should we find workloads that noticeably
depend on this behavior we should find a way to make mark_page_accessed
less expensive.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Mel Gorman Aug. 9, 2019, 5:57 p.m. UTC | #1
On Fri, Aug 09, 2019 at 02:43:24PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> > On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> [...]
> > > > As Nick mentioned in the description, without mark_page_accessed in
> > > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > > while several read(2) calls easily promote it.
> > > 
> > > And is this really a problem? If we refault the same page then the
> > > refaults detection should catch it no? In other words is the above still
> > > a problem these days?
> > 
> > I admit we have been not fair for them because read(2) syscall pages are
> > easily promoted regardless of zap timing unlike mmap-based pages.
> > 
> > However, if we remove the mark_page_accessed in the zap_pte_range, it
> > would make them more unfair in that read(2)-accessed pages are easily
> > promoted while mmap-based page should go through refault to be promoted.
> 
> I have really hard time to follow why an unmap special handling is
> making the overall state more reasonable.
> 
> Anyway, let me throw the patch for further discussion. Nick, Mel,
> Johannes what do you think?
> 

I won't be able to answer follow-ups to this for a while but here is some
superficial thinking.

Minimally, you should test PageReferenced before setting it like
mark_page_accessed does to avoid unnecessary atomics.  I know it wasn't
done that way before but there is no harm in addressing it now.

workingset_activation is necessarily expensive. It could speculatively
lookup memcg outside the RCU read lock and only acquire it if there is
something interesting to lookup. Probably not much help though.

Note that losing the potential workingset_activation from the patch
may have consequences if we are relying on refaults to fix this up. I'm
undecided as to what degree it matters.

That said, I do agree that the mark_page_accessed on page zapping may
be overkill given that it can be a very expensive call if the page
gets activated and it's potentially being called in the zap path at a
high frequency. It's also not a function that is particularly easy to
optimise if you want to cover all the cases that matter. It really would
be preferably to have knowledge of a workload that really cares about
the activations from mmap/touch/munmap.

mark_page_accessed is a hint, it's known that there are gaps with
it so we shouldn't pay too much of a cost on information that only
might be useful. If the system is under no memory pressure because the
workloads are tuned to fit in memory (e.g. database using direct IO)
then mark_page_accessed is only cost. We could avoid marking it accessed
entirely if PF_EXITING given that if a task is exiting, it's not a strong
indication that the page is of any interest.  Even if the page is heavily
shared page and one user exits, the other users will keep it referenced
and prevent reclaim anyway. The benefit is too marginal too.

Given the definite cost of mark_page_accessed in this path and the main
corner case being tasks that access pages via mmap/touch/munmap (which is
insanely expensive if done at high frequency), I think it's reasonable to
rely on SetPageReferenced giving the page another lap of the LRU in most
cases (the obvious exception being CMA forcing reclaim). That opinion
might change if there is a known example of a realistic workload that
would suffer from the lack of explicit activations from teardown context.
Johannes Weiner Aug. 9, 2019, 6:34 p.m. UTC | #2
On Fri, Aug 09, 2019 at 02:43:24PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> > On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> [...]
> > > > As Nick mentioned in the description, without mark_page_accessed in
> > > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > > while several read(2) calls easily promote it.
> > > 
> > > And is this really a problem? If we refault the same page then the
> > > refaults detection should catch it no? In other words is the above still
> > > a problem these days?
> > 
> > I admit we have been not fair for them because read(2) syscall pages are
> > easily promoted regardless of zap timing unlike mmap-based pages.
> > 
> > However, if we remove the mark_page_accessed in the zap_pte_range, it
> > would make them more unfair in that read(2)-accessed pages are easily
> > promoted while mmap-based page should go through refault to be promoted.
> 
> I have really hard time to follow why an unmap special handling is
> making the overall state more reasonable.
> 
> Anyway, let me throw the patch for further discussion. Nick, Mel,
> Johannes what do you think?
> 
> From 3821c2e66347a2141358cabdc6224d9990276fec Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 9 Aug 2019 14:29:59 +0200
> Subject: [PATCH] mm: drop mark_page_access from the unmap path
> 
> Minchan has noticed that mark_page_access can take quite some time
> during unmap:
> : I had a time to benchmark it via adding some trace_printk hooks between
> : pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> : device is 2018 premium mobile device.
> :
> : I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> : task runs on little core even though it doesn't have any IPI and LRU
> : lock contention. It's already too heavy.
> :
> : If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> : so most of overhead(about 0.7ms) comes from activate_page via
> : mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> : accumulate up to several ms.
> 
> bf3f3bc5e734 ("mm: don't mark_page_accessed in fault path") has replaced
> SetPageReferenced by mark_page_accessed arguing that the former is not
> sufficient when mark_page_accessed is removed from the fault path
> because it doesn't promote page to the active list. It is true that a
> page that is mapped by a single process might not get promoted even when
> referenced if the reclaim checks it after the unmap but does that matter
> that much? Can we cosider the page hot if there are no other
> users? Moreover we do have workingset detection in place since then and
> so a next refault would activate the page if it was really hot one.

I do think the pages can be very hot. Think of short-lived executables
and their libraries. Like shell commands. When they run a few times or
periodically, they should be promoted to the active list and not have
to compete with streaming IO on the inactive list - the PG_referenced
doesn't really help them there, see page_check_references().

Maybe the refaults will be fine - but latency expectations around
mapped page cache certainly are a lot higher than unmapped cache.

So I'm a bit reluctant about this patch. If Minchan can be happy with
the lock batching, I'd prefer that.
Michal Hocko Aug. 12, 2019, 8:09 a.m. UTC | #3
On Fri 09-08-19 14:34:24, Johannes Weiner wrote:
> On Fri, Aug 09, 2019 at 02:43:24PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> > > On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > > > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> > [...]
> > > > > As Nick mentioned in the description, without mark_page_accessed in
> > > > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > > > while several read(2) calls easily promote it.
> > > > 
> > > > And is this really a problem? If we refault the same page then the
> > > > refaults detection should catch it no? In other words is the above still
> > > > a problem these days?
> > > 
> > > I admit we have been not fair for them because read(2) syscall pages are
> > > easily promoted regardless of zap timing unlike mmap-based pages.
> > > 
> > > However, if we remove the mark_page_accessed in the zap_pte_range, it
> > > would make them more unfair in that read(2)-accessed pages are easily
> > > promoted while mmap-based page should go through refault to be promoted.
> > 
> > I have really hard time to follow why an unmap special handling is
> > making the overall state more reasonable.
> > 
> > Anyway, let me throw the patch for further discussion. Nick, Mel,
> > Johannes what do you think?
> > 
> > From 3821c2e66347a2141358cabdc6224d9990276fec Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Fri, 9 Aug 2019 14:29:59 +0200
> > Subject: [PATCH] mm: drop mark_page_access from the unmap path
> > 
> > Minchan has noticed that mark_page_access can take quite some time
> > during unmap:
> > : I had a time to benchmark it via adding some trace_printk hooks between
> > : pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > : device is 2018 premium mobile device.
> > :
> > : I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > : task runs on little core even though it doesn't have any IPI and LRU
> > : lock contention. It's already too heavy.
> > :
> > : If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > : so most of overhead(about 0.7ms) comes from activate_page via
> > : mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > : accumulate up to several ms.
> > 
> > bf3f3bc5e734 ("mm: don't mark_page_accessed in fault path") has replaced
> > SetPageReferenced by mark_page_accessed arguing that the former is not
> > sufficient when mark_page_accessed is removed from the fault path
> > because it doesn't promote page to the active list. It is true that a
> > page that is mapped by a single process might not get promoted even when
> > referenced if the reclaim checks it after the unmap but does that matter
> > that much? Can we cosider the page hot if there are no other
> > users? Moreover we do have workingset detection in place since then and
> > so a next refault would activate the page if it was really hot one.
> 
> I do think the pages can be very hot. Think of short-lived executables
> and their libraries. Like shell commands. When they run a few times or
> periodically, they should be promoted to the active list and not have
> to compete with streaming IO on the inactive list - the PG_referenced
> doesn't really help them there, see page_check_references().

Yeah, I am aware of that. We do rely on more processes to map the page
which I've tried to explain in the changelog.

Btw. can we promote PageReferenced pages with zero mapcount? I am
throwing that more as an idea because I haven't really thought that
through yet.

> Maybe the refaults will be fine - but latency expectations around
> mapped page cache certainly are a lot higher than unmapped cache.
>
> So I'm a bit reluctant about this patch. If Minchan can be happy with
> the lock batching, I'd prefer that.

Yes, it seems that the regular lock drop&relock helps in Minchan's case
but this is a kind of change that might have other subtle side effects.
E.g. will-it-scale has noticed a regression [1], likely because the
critical section is shorter and the overal throughput of the operation
decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
much sleep over it normally but we have already seen real regressions
when the locking pattern has changed in the past so I would by a bit
cautious. 

As I've said, this RFC is mostly to open a discussion. I would really
like to weigh the overhead of mark_page_accessed and potential scenario
when refaults would be visible in practice. I can imagine that a short
lived statically linked applications have higher chance of being the
only user unlike libraries which are often being mapped via several
ptes. But the main problem to evaluate this is that there are many other
external factors to trigger the worst case.

[1] http://lkml.kernel.org/r/20190806070547.GA10123@xsang-OptiPlex-9020
Johannes Weiner Aug. 12, 2019, 3:07 p.m. UTC | #4
On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
> On Fri 09-08-19 14:34:24, Johannes Weiner wrote:
> > On Fri, Aug 09, 2019 at 02:43:24PM +0200, Michal Hocko wrote:
> > > On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> > > > On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > > > > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> > > [...]
> > > > > > As Nick mentioned in the description, without mark_page_accessed in
> > > > > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > > > > while several read(2) calls easily promote it.
> > > > > 
> > > > > And is this really a problem? If we refault the same page then the
> > > > > refaults detection should catch it no? In other words is the above still
> > > > > a problem these days?
> > > > 
> > > > I admit we have been not fair for them because read(2) syscall pages are
> > > > easily promoted regardless of zap timing unlike mmap-based pages.
> > > > 
> > > > However, if we remove the mark_page_accessed in the zap_pte_range, it
> > > > would make them more unfair in that read(2)-accessed pages are easily
> > > > promoted while mmap-based page should go through refault to be promoted.
> > > 
> > > I have really hard time to follow why an unmap special handling is
> > > making the overall state more reasonable.
> > > 
> > > Anyway, let me throw the patch for further discussion. Nick, Mel,
> > > Johannes what do you think?
> > > 
> > > From 3821c2e66347a2141358cabdc6224d9990276fec Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.com>
> > > Date: Fri, 9 Aug 2019 14:29:59 +0200
> > > Subject: [PATCH] mm: drop mark_page_access from the unmap path
> > > 
> > > Minchan has noticed that mark_page_access can take quite some time
> > > during unmap:
> > > : I had a time to benchmark it via adding some trace_printk hooks between
> > > : pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > > : device is 2018 premium mobile device.
> > > :
> > > : I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > > : task runs on little core even though it doesn't have any IPI and LRU
> > > : lock contention. It's already too heavy.
> > > :
> > > : If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > > : so most of overhead(about 0.7ms) comes from activate_page via
> > > : mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > > : accumulate up to several ms.
> > > 
> > > bf3f3bc5e734 ("mm: don't mark_page_accessed in fault path") has replaced
> > > SetPageReferenced by mark_page_accessed arguing that the former is not
> > > sufficient when mark_page_accessed is removed from the fault path
> > > because it doesn't promote page to the active list. It is true that a
> > > page that is mapped by a single process might not get promoted even when
> > > referenced if the reclaim checks it after the unmap but does that matter
> > > that much? Can we cosider the page hot if there are no other
> > > users? Moreover we do have workingset detection in place since then and
> > > so a next refault would activate the page if it was really hot one.
> > 
> > I do think the pages can be very hot. Think of short-lived executables
> > and their libraries. Like shell commands. When they run a few times or
> > periodically, they should be promoted to the active list and not have
> > to compete with streaming IO on the inactive list - the PG_referenced
> > doesn't really help them there, see page_check_references().
> 
> Yeah, I am aware of that. We do rely on more processes to map the page
> which I've tried to explain in the changelog.
> 
> Btw. can we promote PageReferenced pages with zero mapcount? I am
> throwing that more as an idea because I haven't really thought that
> through yet.

That flag implements a second-chance policy, see this commit:

commit 645747462435d84c6c6a64269ed49cc3015f753d
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Fri Mar 5 13:42:22 2010 -0800

    vmscan: detect mapped file pages used only once

We had an application that would checksum large files using mmapped IO
to avoid double buffering. The VM used to activate mapped cache
directly, and it trashed the actual workingset.

In response I added support for use-once mapped pages using this flag.
SetPageReferenced signals the VM that we're not sure about the page
yet and give it another round trip on the LRU.

If you activate on this flag, it would restore the initial problem of
use-once pages trashing the workingset.

> > Maybe the refaults will be fine - but latency expectations around
> > mapped page cache certainly are a lot higher than unmapped cache.
> >
> > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > the lock batching, I'd prefer that.
> 
> Yes, it seems that the regular lock drop&relock helps in Minchan's case
> but this is a kind of change that might have other subtle side effects.
> E.g. will-it-scale has noticed a regression [1], likely because the
> critical section is shorter and the overal throughput of the operation
> decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> much sleep over it normally but we have already seen real regressions
> when the locking pattern has changed in the past so I would by a bit
> cautious.

I'm much more concerned about fundamentally changing the aging policy
of mapped page cache then about the lock breaking scheme. With locking
we worry about CPU effects; with aging we worry about additional IO.

> As I've said, this RFC is mostly to open a discussion. I would really
> like to weigh the overhead of mark_page_accessed and potential scenario
> when refaults would be visible in practice. I can imagine that a short
> lived statically linked applications have higher chance of being the
> only user unlike libraries which are often being mapped via several
> ptes. But the main problem to evaluate this is that there are many other
> external factors to trigger the worst case.

We can discuss the pros and cons, but ultimately we simply need to
test it against real workloads to see if changing the promotion rules
regresses the amount of paging we do in practice.
Michal Hocko Aug. 13, 2019, 10:51 a.m. UTC | #5
On Mon 12-08-19 11:07:25, Johannes Weiner wrote:
> On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
[...]
> > Btw. can we promote PageReferenced pages with zero mapcount? I am
> > throwing that more as an idea because I haven't really thought that
> > through yet.
> 
> That flag implements a second-chance policy, see this commit:
> 
> commit 645747462435d84c6c6a64269ed49cc3015f753d
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Fri Mar 5 13:42:22 2010 -0800
> 
>     vmscan: detect mapped file pages used only once
> 
> We had an application that would checksum large files using mmapped IO
> to avoid double buffering. The VM used to activate mapped cache
> directly, and it trashed the actual workingset.
> 
> In response I added support for use-once mapped pages using this flag.
> SetPageReferenced signals the VM that we're not sure about the page
> yet and give it another round trip on the LRU.
> 
> If you activate on this flag, it would restore the initial problem of
> use-once pages trashing the workingset.

You are right of course. I should have realized that! We really need
another piece of information to store to the struct page or maybe xarray
to reflect that.
 
> > > Maybe the refaults will be fine - but latency expectations around
> > > mapped page cache certainly are a lot higher than unmapped cache.
> > >
> > > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > > the lock batching, I'd prefer that.
> > 
> > Yes, it seems that the regular lock drop&relock helps in Minchan's case
> > but this is a kind of change that might have other subtle side effects.
> > E.g. will-it-scale has noticed a regression [1], likely because the
> > critical section is shorter and the overal throughput of the operation
> > decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> > much sleep over it normally but we have already seen real regressions
> > when the locking pattern has changed in the past so I would by a bit
> > cautious.
> 
> I'm much more concerned about fundamentally changing the aging policy
> of mapped page cache then about the lock breaking scheme. With locking
> we worry about CPU effects; with aging we worry about additional IO.

But the later is observable and debuggable little bit easier IMHO.
People are quite used to watch for major faults from my experience
as that is an easy metric to compare.
 
> > As I've said, this RFC is mostly to open a discussion. I would really
> > like to weigh the overhead of mark_page_accessed and potential scenario
> > when refaults would be visible in practice. I can imagine that a short
> > lived statically linked applications have higher chance of being the
> > only user unlike libraries which are often being mapped via several
> > ptes. But the main problem to evaluate this is that there are many other
> > external factors to trigger the worst case.
> 
> We can discuss the pros and cons, but ultimately we simply need to
> test it against real workloads to see if changing the promotion rules
> regresses the amount of paging we do in practice.

Agreed. Do you see other option than to try it out and revert if we see
regressions? We would get a workload description which would be helpful
for future regression testing when touching this area. We can start
slower and keep it in linux-next for a release cycle to catch any
fallouts early.

Thoughts?
Michal Hocko Aug. 26, 2019, 12:06 p.m. UTC | #6
On Tue 13-08-19 12:51:43, Michal Hocko wrote:
> On Mon 12-08-19 11:07:25, Johannes Weiner wrote:
> > On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
[...]
> > > > Maybe the refaults will be fine - but latency expectations around
> > > > mapped page cache certainly are a lot higher than unmapped cache.
> > > >
> > > > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > > > the lock batching, I'd prefer that.
> > > 
> > > Yes, it seems that the regular lock drop&relock helps in Minchan's case
> > > but this is a kind of change that might have other subtle side effects.
> > > E.g. will-it-scale has noticed a regression [1], likely because the
> > > critical section is shorter and the overal throughput of the operation
> > > decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> > > much sleep over it normally but we have already seen real regressions
> > > when the locking pattern has changed in the past so I would by a bit
> > > cautious.
> > 
> > I'm much more concerned about fundamentally changing the aging policy
> > of mapped page cache then about the lock breaking scheme. With locking
> > we worry about CPU effects; with aging we worry about additional IO.
> 
> But the later is observable and debuggable little bit easier IMHO.
> People are quite used to watch for major faults from my experience
> as that is an easy metric to compare.
>  
> > > As I've said, this RFC is mostly to open a discussion. I would really
> > > like to weigh the overhead of mark_page_accessed and potential scenario
> > > when refaults would be visible in practice. I can imagine that a short
> > > lived statically linked applications have higher chance of being the
> > > only user unlike libraries which are often being mapped via several
> > > ptes. But the main problem to evaluate this is that there are many other
> > > external factors to trigger the worst case.
> > 
> > We can discuss the pros and cons, but ultimately we simply need to
> > test it against real workloads to see if changing the promotion rules
> > regresses the amount of paging we do in practice.
> 
> Agreed. Do you see other option than to try it out and revert if we see
> regressions? We would get a workload description which would be helpful
> for future regression testing when touching this area. We can start
> slower and keep it in linux-next for a release cycle to catch any
> fallouts early.
> 
> Thoughts?

ping...
Johannes Weiner Aug. 27, 2019, 4 p.m. UTC | #7
On Mon, Aug 26, 2019 at 02:06:30PM +0200, Michal Hocko wrote:
> On Tue 13-08-19 12:51:43, Michal Hocko wrote:
> > On Mon 12-08-19 11:07:25, Johannes Weiner wrote:
> > > On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
> [...]
> > > > > Maybe the refaults will be fine - but latency expectations around
> > > > > mapped page cache certainly are a lot higher than unmapped cache.
> > > > >
> > > > > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > > > > the lock batching, I'd prefer that.
> > > > 
> > > > Yes, it seems that the regular lock drop&relock helps in Minchan's case
> > > > but this is a kind of change that might have other subtle side effects.
> > > > E.g. will-it-scale has noticed a regression [1], likely because the
> > > > critical section is shorter and the overal throughput of the operation
> > > > decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> > > > much sleep over it normally but we have already seen real regressions
> > > > when the locking pattern has changed in the past so I would by a bit
> > > > cautious.
> > > 
> > > I'm much more concerned about fundamentally changing the aging policy
> > > of mapped page cache then about the lock breaking scheme. With locking
> > > we worry about CPU effects; with aging we worry about additional IO.
> > 
> > But the later is observable and debuggable little bit easier IMHO.
> > People are quite used to watch for major faults from my experience
> > as that is an easy metric to compare.

Rootcausing additional (re)faults is really difficult. We're talking
about a slight trend change in caching behavior in a sea of millions
of pages. There could be so many factors causing this, and for most
you have to patch debugging stuff into the kernel to rule them out.

A CPU regression you can figure out with perf.

> > > > As I've said, this RFC is mostly to open a discussion. I would really
> > > > like to weigh the overhead of mark_page_accessed and potential scenario
> > > > when refaults would be visible in practice. I can imagine that a short
> > > > lived statically linked applications have higher chance of being the
> > > > only user unlike libraries which are often being mapped via several
> > > > ptes. But the main problem to evaluate this is that there are many other
> > > > external factors to trigger the worst case.
> > > 
> > > We can discuss the pros and cons, but ultimately we simply need to
> > > test it against real workloads to see if changing the promotion rules
> > > regresses the amount of paging we do in practice.
> > 
> > Agreed. Do you see other option than to try it out and revert if we see
> > regressions? We would get a workload description which would be helpful
> > for future regression testing when touching this area. We can start
> > slower and keep it in linux-next for a release cycle to catch any
> > fallouts early.
> > 
> > Thoughts?
> 
> ping...

Personally, I'm not convinced by this patch. I think it's a pretty
drastic change in aging heuristics just to address a CPU overhead
problem that has simpler, easier to verify, alternative solutions.

It WOULD be great to clarify and improve the aging model for mapped
cache, to make it a bit easier to reason about. But this patch does
not really get there either. Instead of taking a serious look at
mapped cache lifetime and usage scenarios, the changelog is more in
"let's see what breaks if we take out this screw here" territory.

So I'm afraid I don't think the patch & changelog in its current shape
should go upstream.
Michal Hocko Aug. 27, 2019, 6:41 p.m. UTC | #8
On Tue 27-08-19 12:00:26, Johannes Weiner wrote:
> On Mon, Aug 26, 2019 at 02:06:30PM +0200, Michal Hocko wrote:
> > On Tue 13-08-19 12:51:43, Michal Hocko wrote:
> > > On Mon 12-08-19 11:07:25, Johannes Weiner wrote:
> > > > On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
> > [...]
> > > > > > Maybe the refaults will be fine - but latency expectations around
> > > > > > mapped page cache certainly are a lot higher than unmapped cache.
> > > > > >
> > > > > > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > > > > > the lock batching, I'd prefer that.
> > > > > 
> > > > > Yes, it seems that the regular lock drop&relock helps in Minchan's case
> > > > > but this is a kind of change that might have other subtle side effects.
> > > > > E.g. will-it-scale has noticed a regression [1], likely because the
> > > > > critical section is shorter and the overal throughput of the operation
> > > > > decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> > > > > much sleep over it normally but we have already seen real regressions
> > > > > when the locking pattern has changed in the past so I would by a bit
> > > > > cautious.
> > > > 
> > > > I'm much more concerned about fundamentally changing the aging policy
> > > > of mapped page cache then about the lock breaking scheme. With locking
> > > > we worry about CPU effects; with aging we worry about additional IO.
> > > 
> > > But the later is observable and debuggable little bit easier IMHO.
> > > People are quite used to watch for major faults from my experience
> > > as that is an easy metric to compare.
> 
> Rootcausing additional (re)faults is really difficult. We're talking
> about a slight trend change in caching behavior in a sea of millions
> of pages. There could be so many factors causing this, and for most
> you have to patch debugging stuff into the kernel to rule them out.
> 
> A CPU regression you can figure out with perf.
> 
> > > > > As I've said, this RFC is mostly to open a discussion. I would really
> > > > > like to weigh the overhead of mark_page_accessed and potential scenario
> > > > > when refaults would be visible in practice. I can imagine that a short
> > > > > lived statically linked applications have higher chance of being the
> > > > > only user unlike libraries which are often being mapped via several
> > > > > ptes. But the main problem to evaluate this is that there are many other
> > > > > external factors to trigger the worst case.
> > > > 
> > > > We can discuss the pros and cons, but ultimately we simply need to
> > > > test it against real workloads to see if changing the promotion rules
> > > > regresses the amount of paging we do in practice.
> > > 
> > > Agreed. Do you see other option than to try it out and revert if we see
> > > regressions? We would get a workload description which would be helpful
> > > for future regression testing when touching this area. We can start
> > > slower and keep it in linux-next for a release cycle to catch any
> > > fallouts early.
> > > 
> > > Thoughts?
> > 
> > ping...
> 
> Personally, I'm not convinced by this patch. I think it's a pretty
> drastic change in aging heuristics just to address a CPU overhead
> problem that has simpler, easier to verify, alternative solutions.
> 
> It WOULD be great to clarify and improve the aging model for mapped
> cache, to make it a bit easier to reason about.

I fully agree with this! Do you have any specific ideas? I am afraid I
am unlikely to find time for a larger project that this sounds to be but
maybe others will find this as a good fit.

> But this patch does
> not really get there either. Instead of taking a serious look at
> mapped cache lifetime and usage scenarios, the changelog is more in
> "let's see what breaks if we take out this screw here" territory.

You know that I tend to be quite conservative. In this case I can see
the cost which is not negligible and likely to hit many workloads
because it is a common path. The immediate benefit is not really clear,
though, at least to me. We can speculate and I would really love to hear
from Nick what exactly led him to this change.
 
> So I'm afraid I don't think the patch & changelog in its current shape
> should go upstream.

I will not insist of course but it would be really great to know and
_document_ why we are doing this. I really hate how often we keep
different heuristics and build more complex solutions on top just
because nobody dares to change that.

Our code is really hard to reason about.

Patch
diff mbox series

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..ced521df8ee7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1053,7 +1053,7 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
-					mark_page_accessed(page);
+					SetPageReferenced(page);
 			}
 			rss[mm_counter(page)]--;
 			page_remove_rmap(page, false);