All of lore.kernel.org
 help / color / mirror / Atom feed
* Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-09 16:52 ` Andres Freund
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-09 16:52 UTC (permalink / raw)
  To: Johannes Weiner, Rik van Riel; +Cc: linux-mm, linux-kernel, Vlastimil Babka

Hi,

I'm working on fixing long IO stalls with postgres. After some
architectural changes fixing the worst issues, I noticed that indivdiual
processes/backends/connections still spend more time waiting than I'd
expect.

In an workload with the hot data set fitting into memory (2GB of
mmap(HUGE|ANNON) shared memory for postgres buffer cache, ~6GB of
dataset, 16GB total memory) I found that there's more reads hitting disk
that I'd expect.  That's after I've led Vlastimil on IRC down a wrong
rabbithole, sorry for that.

Some tinkering and question later, the issue appears to be postgres'
journal/WAL. Which in the test-setup is write-only, and only touched
again when individual segments of the WAL are reused. Which, in the
configuration I'm using, only happens after ~20min and 30GB later or so.
Drastically reducing the volume of WAL through some (unsafe)
configuration options, or forcing the WAL to be written using O_DIRECT,
changes the workload to be fully cached.

Rik asked me about active/inactive sizing in /proc/meminfo:
Active:          7860556 kB
Inactive:        5395644 kB
Active(anon):    2874936 kB
Inactive(anon):   432308 kB
Active(file):    4985620 kB
Inactive(file):  4963336 kB

and then said:

riel   | the workingset stuff does not appear to be taken into account for active/inactive list sizing, in vmscan.c
riel   | I suspect we will want to expand the vmscan.c code, to take the workingset stats into account
riel   | when we re-fault a page that was on the active list before, we want to grow the size of the active list (and
       | shrink from inactive)
riel   | when we re-fault a page that was never active, we need to grow the size of the inactive list (and shrink
       | active)
riel   | but I don't think we have any bits free in page flags for that, we may need to improvise something :)

andres | Ok, at this point I'm kinda out of my depth here ;)

riel   | andres: basically active & inactive file LRUs are kept at the same size currently
riel   | andres: which means anything that overflows half of memory will get flushed out of the cache by large write
       | volumes (to the write-only log)
riel   | andres: what we should do is dynamically size the active & inactive file lists, depending on which of the two
       | needs more caching
riel   | andres: if we never re-use the inactive pages that get flushed out, there's no sense in caching more of them
       | (and we could dedicate more memory to the active list, instead)

andres | Sounds sensible. I guess things get really tricky if there's a portion of the inactive list that does get
       | reused (say if the hot data set is larger than memory), and another doesn't get reused at all.

I promised to send an email about the issue...

I provide you with a branch of postgres + instructions to reproduce the
issue, or I can test patches, whatever you prefer.

This test was run using 4.5.0-rc2, but I doubt this is a recent
regression or such.

Any other information I can provide you with?

Regards,

Andres

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-09 16:52 ` Andres Freund
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-09 16:52 UTC (permalink / raw)
  To: Johannes Weiner, Rik van Riel; +Cc: linux-mm, linux-kernel, Vlastimil Babka

Hi,

I'm working on fixing long IO stalls with postgres. After some
architectural changes fixing the worst issues, I noticed that indivdiual
processes/backends/connections still spend more time waiting than I'd
expect.

In an workload with the hot data set fitting into memory (2GB of
mmap(HUGE|ANNON) shared memory for postgres buffer cache, ~6GB of
dataset, 16GB total memory) I found that there's more reads hitting disk
that I'd expect.  That's after I've led Vlastimil on IRC down a wrong
rabbithole, sorry for that.

Some tinkering and question later, the issue appears to be postgres'
journal/WAL. Which in the test-setup is write-only, and only touched
again when individual segments of the WAL are reused. Which, in the
configuration I'm using, only happens after ~20min and 30GB later or so.
Drastically reducing the volume of WAL through some (unsafe)
configuration options, or forcing the WAL to be written using O_DIRECT,
changes the workload to be fully cached.

Rik asked me about active/inactive sizing in /proc/meminfo:
Active:          7860556 kB
Inactive:        5395644 kB
Active(anon):    2874936 kB
Inactive(anon):   432308 kB
Active(file):    4985620 kB
Inactive(file):  4963336 kB

and then said:

riel   | the workingset stuff does not appear to be taken into account for active/inactive list sizing, in vmscan.c
riel   | I suspect we will want to expand the vmscan.c code, to take the workingset stats into account
riel   | when we re-fault a page that was on the active list before, we want to grow the size of the active list (and
       | shrink from inactive)
riel   | when we re-fault a page that was never active, we need to grow the size of the inactive list (and shrink
       | active)
riel   | but I don't think we have any bits free in page flags for that, we may need to improvise something :)

andres | Ok, at this point I'm kinda out of my depth here ;)

riel   | andres: basically active & inactive file LRUs are kept at the same size currently
riel   | andres: which means anything that overflows half of memory will get flushed out of the cache by large write
       | volumes (to the write-only log)
riel   | andres: what we should do is dynamically size the active & inactive file lists, depending on which of the two
       | needs more caching
riel   | andres: if we never re-use the inactive pages that get flushed out, there's no sense in caching more of them
       | (and we could dedicate more memory to the active list, instead)

andres | Sounds sensible. I guess things get really tricky if there's a portion of the inactive list that does get
       | reused (say if the hot data set is larger than memory), and another doesn't get reused at all.

I promised to send an email about the issue...

I provide you with a branch of postgres + instructions to reproduce the
issue, or I can test patches, whatever you prefer.

This test was run using 4.5.0-rc2, but I doubt this is a recent
regression or such.

Any other information I can provide you with?

Regards,

Andres

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-09 16:52 ` Andres Freund
@ 2016-02-09 22:42   ` Johannes Weiner
  -1 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2016-02-09 22:42 UTC (permalink / raw)
  To: Andres Freund; +Cc: Rik van Riel, linux-mm, linux-kernel, Vlastimil Babka

Hi,

On Tue, Feb 09, 2016 at 05:52:40PM +0100, Andres Freund wrote:
> Hi,
> 
> I'm working on fixing long IO stalls with postgres. After some
> architectural changes fixing the worst issues, I noticed that indivdiual
> processes/backends/connections still spend more time waiting than I'd
> expect.
> 
> In an workload with the hot data set fitting into memory (2GB of
> mmap(HUGE|ANNON) shared memory for postgres buffer cache, ~6GB of
> dataset, 16GB total memory) I found that there's more reads hitting disk
> that I'd expect.  That's after I've led Vlastimil on IRC down a wrong
> rabbithole, sorry for that.
> 
> Some tinkering and question later, the issue appears to be postgres'
> journal/WAL. Which in the test-setup is write-only, and only touched
> again when individual segments of the WAL are reused. Which, in the
> configuration I'm using, only happens after ~20min and 30GB later or so.
> Drastically reducing the volume of WAL through some (unsafe)
> configuration options, or forcing the WAL to be written using O_DIRECT,
> changes the workload to be fully cached.
> 
> Rik asked me about active/inactive sizing in /proc/meminfo:
> Active:          7860556 kB
> Inactive:        5395644 kB
> Active(anon):    2874936 kB
> Inactive(anon):   432308 kB
> Active(file):    4985620 kB
> Inactive(file):  4963336 kB
> 
> and then said:
> 
> riel   | the workingset stuff does not appear to be taken into account for active/inactive list sizing, in vmscan.c
> riel   | I suspect we will want to expand the vmscan.c code, to take the workingset stats into account
> riel   | when we re-fault a page that was on the active list before, we want to grow the size of the active list (and
>        | shrink from inactive)
> riel   | when we re-fault a page that was never active, we need to grow the size of the inactive list (and shrink
>        | active)
> riel   | but I don't think we have any bits free in page flags for that, we may need to improvise something :)
>
> andres | Ok, at this point I'm kinda out of my depth here ;)
> 
> riel   | andres: basically active & inactive file LRUs are kept at the same size currently
> riel   | andres: which means anything that overflows half of memory will get flushed out of the cache by large write
>        | volumes (to the write-only log)
> riel   | andres: what we should do is dynamically size the active & inactive file lists, depending on which of the two
>        | needs more caching
> riel   | andres: if we never re-use the inactive pages that get flushed out, there's no sense in caching more of them
>        | (and we could dedicate more memory to the active list, instead)

Yes, a generous minimum size of the inactive list made sense when it
was the exclusive staging area to tell use-once pages from use-many
pages. Now that we have refault information to detect use-many with
arbitrary inactive list size, this minimum is no longer reasonable.

The new minimum should be smaller, but big enough for applications to
actually use the data in their pages between fault and eviction
(i.e. it needs to take the aggregate readahead window into account),
and big enough for active pages that are speculatively challenged
during workingset changes to get re-activated without incurring IO.

However, I don't think it makes sense to dynamically adjust the
balance between the active and the inactive cache during refaults.

I assume your thinking here is that when never-active pages are
refaulting during workingset transitions, it's an indication that
inactive cache need more slots, to detect use-many without incurring
IO. And hence we should give them some slots from the active cache.

However, deactivation doesn't give the inactive cache more slots to
use, it just reassigns already occupied cache slots. The only way to
actually increase the number of available inactive cache slots upon
refault would be to reclaim active cache slots.

And that is something we can't do, because we don't know how hot the
incumbent active pages actually are. They could be hotter than the
challenging refault page, they could be colder. So what we are doing
now is putting them next to each other - currently by activating the
refault page, but we could also deactivate the incumbent - and let the
aging machinery pick a winner.

[ We *could* do active list reclaim, but it would cause IO in the case
  where the incumbent workingset is challenged but not defeated.

  It's a trade-off. We just decide how strongly we want to protect the
  incumbent under challenge. ]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-09 22:42   ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2016-02-09 22:42 UTC (permalink / raw)
  To: Andres Freund; +Cc: Rik van Riel, linux-mm, linux-kernel, Vlastimil Babka

Hi,

On Tue, Feb 09, 2016 at 05:52:40PM +0100, Andres Freund wrote:
> Hi,
> 
> I'm working on fixing long IO stalls with postgres. After some
> architectural changes fixing the worst issues, I noticed that indivdiual
> processes/backends/connections still spend more time waiting than I'd
> expect.
> 
> In an workload with the hot data set fitting into memory (2GB of
> mmap(HUGE|ANNON) shared memory for postgres buffer cache, ~6GB of
> dataset, 16GB total memory) I found that there's more reads hitting disk
> that I'd expect.  That's after I've led Vlastimil on IRC down a wrong
> rabbithole, sorry for that.
> 
> Some tinkering and question later, the issue appears to be postgres'
> journal/WAL. Which in the test-setup is write-only, and only touched
> again when individual segments of the WAL are reused. Which, in the
> configuration I'm using, only happens after ~20min and 30GB later or so.
> Drastically reducing the volume of WAL through some (unsafe)
> configuration options, or forcing the WAL to be written using O_DIRECT,
> changes the workload to be fully cached.
> 
> Rik asked me about active/inactive sizing in /proc/meminfo:
> Active:          7860556 kB
> Inactive:        5395644 kB
> Active(anon):    2874936 kB
> Inactive(anon):   432308 kB
> Active(file):    4985620 kB
> Inactive(file):  4963336 kB
> 
> and then said:
> 
> riel   | the workingset stuff does not appear to be taken into account for active/inactive list sizing, in vmscan.c
> riel   | I suspect we will want to expand the vmscan.c code, to take the workingset stats into account
> riel   | when we re-fault a page that was on the active list before, we want to grow the size of the active list (and
>        | shrink from inactive)
> riel   | when we re-fault a page that was never active, we need to grow the size of the inactive list (and shrink
>        | active)
> riel   | but I don't think we have any bits free in page flags for that, we may need to improvise something :)
>
> andres | Ok, at this point I'm kinda out of my depth here ;)
> 
> riel   | andres: basically active & inactive file LRUs are kept at the same size currently
> riel   | andres: which means anything that overflows half of memory will get flushed out of the cache by large write
>        | volumes (to the write-only log)
> riel   | andres: what we should do is dynamically size the active & inactive file lists, depending on which of the two
>        | needs more caching
> riel   | andres: if we never re-use the inactive pages that get flushed out, there's no sense in caching more of them
>        | (and we could dedicate more memory to the active list, instead)

Yes, a generous minimum size of the inactive list made sense when it
was the exclusive staging area to tell use-once pages from use-many
pages. Now that we have refault information to detect use-many with
arbitrary inactive list size, this minimum is no longer reasonable.

The new minimum should be smaller, but big enough for applications to
actually use the data in their pages between fault and eviction
(i.e. it needs to take the aggregate readahead window into account),
and big enough for active pages that are speculatively challenged
during workingset changes to get re-activated without incurring IO.

However, I don't think it makes sense to dynamically adjust the
balance between the active and the inactive cache during refaults.

I assume your thinking here is that when never-active pages are
refaulting during workingset transitions, it's an indication that
inactive cache need more slots, to detect use-many without incurring
IO. And hence we should give them some slots from the active cache.

However, deactivation doesn't give the inactive cache more slots to
use, it just reassigns already occupied cache slots. The only way to
actually increase the number of available inactive cache slots upon
refault would be to reclaim active cache slots.

And that is something we can't do, because we don't know how hot the
incumbent active pages actually are. They could be hotter than the
challenging refault page, they could be colder. So what we are doing
now is putting them next to each other - currently by activating the
refault page, but we could also deactivate the incumbent - and let the
aging machinery pick a winner.

[ We *could* do active list reclaim, but it would cause IO in the case
  where the incumbent workingset is challenged but not defeated.

  It's a trade-off. We just decide how strongly we want to protect the
  incumbent under challenge. ]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-09 22:42   ` Johannes Weiner
@ 2016-02-11 20:34     ` Rik van Riel
  -1 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-02-11 20:34 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andres Freund, linux-mm, linux-kernel, Vlastimil Babka

On Tue, 9 Feb 2016 17:42:56 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Feb 09, 2016 at 05:52:40PM +0100, Andres Freund wrote:

> > Rik asked me about active/inactive sizing in /proc/meminfo:
> > Active:          7860556 kB
> > Inactive:        5395644 kB
> > Active(anon):    2874936 kB
> > Inactive(anon):   432308 kB
> > Active(file):    4985620 kB
> > Inactive(file):  4963336 kB

> Yes, a generous minimum size of the inactive list made sense when it
> was the exclusive staging area to tell use-once pages from use-many
> pages. Now that we have refault information to detect use-many with
> arbitrary inactive list size, this minimum is no longer reasonable.
> 
> The new minimum should be smaller, but big enough for applications to
> actually use the data in their pages between fault and eviction
> (i.e. it needs to take the aggregate readahead window into account),
> and big enough for active pages that are speculatively challenged
> during workingset changes to get re-activated without incurring IO.
> 
> However, I don't think it makes sense to dynamically adjust the
> balance between the active and the inactive cache during refaults.

Johannes, does this patch look ok to you?

Andres, does this patch work for you?

-----8<-----
Subject: mm,vmscan: reduce size of inactive file list

The inactive file list should still be large enough to contain
readahead windows and freshly written file data, but it no
longer is the only source for detecting multiple accesses to
file pages. The workingset refault measurement code causes
recently evicted file pages that get accessed again after a
shorter interval to be promoted directly to the active list.

With that mechanism in place, we can afford to (on a larger
system) dedicate more memory to the active file list, so we
can actually cache more of the frequently used file pages
in memory, and not have them pushed out by streaming writes,
once-used streaming file reads, etc.

This can help things like database workloads, where only
half the page cache can currently be used to cache the
database working set. This patch automatically increases
that fraction on larger systems, using the same ratio that
has already been used for anonymous memory.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Andres Freund <andres@anarazel.de>
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb3dd37ccd7c..0a316c41bf80 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1928,13 +1928,14 @@ static inline bool inactive_anon_is_low(struct lruvec *lruvec)
  */
 static bool inactive_file_is_low(struct lruvec *lruvec)
 {
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long inactive;
 	unsigned long active;
 
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
 
-	return active > inactive;
+	return inactive * zone->inactive_ratio < active;
 }
 
 static bool inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-11 20:34     ` Rik van Riel
  0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-02-11 20:34 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andres Freund, linux-mm, linux-kernel, Vlastimil Babka

On Tue, 9 Feb 2016 17:42:56 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Feb 09, 2016 at 05:52:40PM +0100, Andres Freund wrote:

> > Rik asked me about active/inactive sizing in /proc/meminfo:
> > Active:          7860556 kB
> > Inactive:        5395644 kB
> > Active(anon):    2874936 kB
> > Inactive(anon):   432308 kB
> > Active(file):    4985620 kB
> > Inactive(file):  4963336 kB

> Yes, a generous minimum size of the inactive list made sense when it
> was the exclusive staging area to tell use-once pages from use-many
> pages. Now that we have refault information to detect use-many with
> arbitrary inactive list size, this minimum is no longer reasonable.
> 
> The new minimum should be smaller, but big enough for applications to
> actually use the data in their pages between fault and eviction
> (i.e. it needs to take the aggregate readahead window into account),
> and big enough for active pages that are speculatively challenged
> during workingset changes to get re-activated without incurring IO.
> 
> However, I don't think it makes sense to dynamically adjust the
> balance between the active and the inactive cache during refaults.

Johannes, does this patch look ok to you?

Andres, does this patch work for you?

-----8<-----
Subject: mm,vmscan: reduce size of inactive file list

The inactive file list should still be large enough to contain
readahead windows and freshly written file data, but it no
longer is the only source for detecting multiple accesses to
file pages. The workingset refault measurement code causes
recently evicted file pages that get accessed again after a
shorter interval to be promoted directly to the active list.

With that mechanism in place, we can afford to (on a larger
system) dedicate more memory to the active file list, so we
can actually cache more of the frequently used file pages
in memory, and not have them pushed out by streaming writes,
once-used streaming file reads, etc.

This can help things like database workloads, where only
half the page cache can currently be used to cache the
database working set. This patch automatically increases
that fraction on larger systems, using the same ratio that
has already been used for anonymous memory.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Andres Freund <andres@anarazel.de>
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb3dd37ccd7c..0a316c41bf80 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1928,13 +1928,14 @@ static inline bool inactive_anon_is_low(struct lruvec *lruvec)
  */
 static bool inactive_file_is_low(struct lruvec *lruvec)
 {
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long inactive;
 	unsigned long active;
 
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
 
-	return active > inactive;
+	return inactive * zone->inactive_ratio < active;
 }
 
 static bool inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-11 20:34     ` Rik van Riel
@ 2016-02-12 12:46       ` Andres Freund
  -1 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-12 12:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

Hi,

On 2016-02-11 15:34:04 -0500, Rik van Riel wrote:
> Andres, does this patch work for you?

TL;DR: The patch helps, there might be some additional, largely
independent, further improvements.


So, I tested this. And under the right set of cirumstances the results
are pretty good:

----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
before (4.5.0-rc3-andres-00010-g765bdb4):
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 550
query mode: prepared
number of clients: 32
number of threads: 32
duration: 300 s
number of transactions actually processed: 3539535
latency average = 2.710 ms
latency stddev = 7.738 ms
tps = 11797.755387 (including connections establishing)
tps = 11798.515737 (excluding connections establishing)

Active:          6890844 kB
Inactive:        6409868 kB
Active(anon):     684804 kB
Inactive(anon):   202960 kB
Active(file):    6206040 kB
Inactive(file):  6206908 kB


~20MB reads/s

----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----

after (4.5.0-rc3-andres-00011-g2b9abf6-dirty):
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 550
query mode: prepared
number of clients: 32
number of threads: 32
duration: 300 s
number of transactions actually processed: 4372722
latency average = 2.192 ms
latency stddev = 6.095 ms
tps = 14575.395438 (including connections establishing)
tps = 14576.212433 (excluding connections establishing)


Active:          9460392 kB
Inactive:        3813748 kB
Active(anon):     444724 kB
Inactive(anon):   329368 kB
Active(file):    9015668 kB
Inactive(file):  3484380 kB

0MB reads/s

----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----

And that's on a system with a relatively fast SSD. I suspect the
difference would be massively bigger on systems with rotational media,
given the reads in the "before" state are almost entirely random.


Unfortunately the above "right set of circumstances" aren't entirely
trivial to match. Postgres writes its journal/WAL to pre-allocated WAL
files, which are then recycled after a while. The journal writes are
currently intentionally not marked with DONTNEED or O_DIRECT because for
e.g. replication and such internal and external processes may read the
WAL again.

Without additional changes in postgres, the recycling often appears to
lead to the WAL files to be promoted onto the active list, even if they
are only ever written to in a streaming manner, never read. That's easy
enough to defeat, by doing a posix_fadvise(DONTNEED) during recycling;
after all there's never a need to read the previous contents before
that.  But if the window till recycling is large enough, it appears to
often "defeat" other inactive pages to be promoted to active. Partially
that's probably hard to fix.

I'm wondering why pages that are repeatedly written to, in units above
the page size, are promoted to the active list? I mean if there never
are any reads or re-dirtying an already-dirty page, what's the benefit
of moving that page onto the active list?

I imagine that high volumne streaming writes are generally pretty common
(non durability log-files!), and streaming over-writing sounds also
something a number of applications are doing.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-12 12:46       ` Andres Freund
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-12 12:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

Hi,

On 2016-02-11 15:34:04 -0500, Rik van Riel wrote:
> Andres, does this patch work for you?

TL;DR: The patch helps, there might be some additional, largely
independent, further improvements.


So, I tested this. And under the right set of cirumstances the results
are pretty good:

----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
before (4.5.0-rc3-andres-00010-g765bdb4):
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 550
query mode: prepared
number of clients: 32
number of threads: 32
duration: 300 s
number of transactions actually processed: 3539535
latency average = 2.710 ms
latency stddev = 7.738 ms
tps = 11797.755387 (including connections establishing)
tps = 11798.515737 (excluding connections establishing)

Active:          6890844 kB
Inactive:        6409868 kB
Active(anon):     684804 kB
Inactive(anon):   202960 kB
Active(file):    6206040 kB
Inactive(file):  6206908 kB


~20MB reads/s

----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----

after (4.5.0-rc3-andres-00011-g2b9abf6-dirty):
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 550
query mode: prepared
number of clients: 32
number of threads: 32
duration: 300 s
number of transactions actually processed: 4372722
latency average = 2.192 ms
latency stddev = 6.095 ms
tps = 14575.395438 (including connections establishing)
tps = 14576.212433 (excluding connections establishing)


Active:          9460392 kB
Inactive:        3813748 kB
Active(anon):     444724 kB
Inactive(anon):   329368 kB
Active(file):    9015668 kB
Inactive(file):  3484380 kB

0MB reads/s

----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----

And that's on a system with a relatively fast SSD. I suspect the
difference would be massively bigger on systems with rotational media,
given the reads in the "before" state are almost entirely random.


Unfortunately the above "right set of circumstances" aren't entirely
trivial to match. Postgres writes its journal/WAL to pre-allocated WAL
files, which are then recycled after a while. The journal writes are
currently intentionally not marked with DONTNEED or O_DIRECT because for
e.g. replication and such internal and external processes may read the
WAL again.

Without additional changes in postgres, the recycling often appears to
lead to the WAL files to be promoted onto the active list, even if they
are only ever written to in a streaming manner, never read. That's easy
enough to defeat, by doing a posix_fadvise(DONTNEED) during recycling;
after all there's never a need to read the previous contents before
that.  But if the window till recycling is large enough, it appears to
often "defeat" other inactive pages to be promoted to active. Partially
that's probably hard to fix.

I'm wondering why pages that are repeatedly written to, in units above
the page size, are promoted to the active list? I mean if there never
are any reads or re-dirtying an already-dirty page, what's the benefit
of moving that page onto the active list?

I imagine that high volumne streaming writes are generally pretty common
(non durability log-files!), and streaming over-writing sounds also
something a number of applications are doing.

Greetings,

Andres Freund

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-11 20:34     ` Rik van Riel
@ 2016-02-12 12:56       ` Andres Freund
  -1 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-12 12:56 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

Hi,

On 2016-02-11 15:34:04 -0500, Rik van Riel wrote:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eb3dd37ccd7c..0a316c41bf80 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1928,13 +1928,14 @@ static inline bool inactive_anon_is_low(struct lruvec *lruvec)
>   */
>  static bool inactive_file_is_low(struct lruvec *lruvec)
>  {
> +	struct zone *zone = lruvec_zone(lruvec);
>  	unsigned long inactive;
>  	unsigned long active;
>  
>  	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
>  	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
>  
> -	return active > inactive;
> +	return inactive * zone->inactive_ratio < active;
>  }

Oh, it looks to me like pat of inactive_file_is_low()'s description:
 *
 * When the system is doing streaming IO, memory pressure here
 * ensures that active file pages get deactivated, until more
 * than half of the file pages are on the inactive list.
 *
Would need updating with this patch.

Regards,

Andres Freund

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-12 12:56       ` Andres Freund
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-12 12:56 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

Hi,

On 2016-02-11 15:34:04 -0500, Rik van Riel wrote:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eb3dd37ccd7c..0a316c41bf80 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1928,13 +1928,14 @@ static inline bool inactive_anon_is_low(struct lruvec *lruvec)
>   */
>  static bool inactive_file_is_low(struct lruvec *lruvec)
>  {
> +	struct zone *zone = lruvec_zone(lruvec);
>  	unsigned long inactive;
>  	unsigned long active;
>  
>  	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
>  	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
>  
> -	return active > inactive;
> +	return inactive * zone->inactive_ratio < active;
>  }

Oh, it looks to me like pat of inactive_file_is_low()'s description:
 *
 * When the system is doing streaming IO, memory pressure here
 * ensures that active file pages get deactivated, until more
 * than half of the file pages are on the inactive list.
 *
Would need updating with this patch.

Regards,

Andres Freund

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-12 12:46       ` Andres Freund
@ 2016-02-12 19:35         ` Andres Freund
  -1 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-12 19:35 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

On 2016-02-12 13:46:53 +0100, Andres Freund wrote:
> I'm wondering why pages that are repeatedly written to, in units above
> the page size, are promoted to the active list? I mean if there never
> are any reads or re-dirtying an already-dirty page, what's the benefit
> of moving that page onto the active list?

We chatted about this on IRC and you proposed testing this by removing
FGP_ACCESSED in grab_cache_page_write_begin.  I ran tests with that,
after removing the aforementioned code to issue posix_fadvise(DONTNEED)
in postgres.

base (4.5-rc2+10)
        latency average = 3.079 ms
        latency stddev = 8.269 ms
        tps = 10384.545914 (including connections establishing)
        tps = 10384.866341 (excluding connections establishing)


inactive/active patch:
        latency average = 2.931 ms
        latency stddev = 7.683 ms
        tps = 10908.905039 (including connections establishing)
        tps = 10909.256946 (excluding connections establishing)


inactive/active patch + no FGP_ACCESSED in grab_cache_page_write_begin:
        latency average = 2.806 ms
        latency stddev = 7.871 ms
        tps = 11392.893213 (including connections establishing)
        tps = 11393.839826 (excluding connections establishing)


Here the active/inactive lists didn't change as much as I hoped. A bit
of reading made it apparent that the workingset logic in
add_to_page_cache_lru() defated that attempt, by moving an previously
discarded page directly into the active list. I added a variant of
add_to_page_cache_lru() that accepts fgp_flags and only does the
workingset check if FGP_ACCESSED is set. That results in:

inactive/active patch + no FGP_ACCESSED in grab_cache_page_write_begin * add_to_page_cache_lru:
        latency average: 2.292 ms
        latency stddev: 6.487 ms
        tps = 13940.530898 (including connections establishing)
        tps = 13941.774874 (excluding connections establishing)

that's only slightly worse than doing explicit posix_fadvise(DONTNEED)
calls... Pretty good.

To make an actually usable patch out of this it seems we'd have to add a
'partial' argument to grab_cache_page_write_begin(), so writes to parts
of a page still cause the pages to be marked active.  Is it preferrable
to change all callers of grab_cache_page_write_begin and
add_to_page_cache_lru or make them into wrapper functions, and call the
real deal when it matters?

I do think that that's a reasonable algorithmic change, but nonetheless
its obviously possible that such changes regress some workloads. What's
the policy around testing such things?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-12 19:35         ` Andres Freund
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-12 19:35 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

On 2016-02-12 13:46:53 +0100, Andres Freund wrote:
> I'm wondering why pages that are repeatedly written to, in units above
> the page size, are promoted to the active list? I mean if there never
> are any reads or re-dirtying an already-dirty page, what's the benefit
> of moving that page onto the active list?

We chatted about this on IRC and you proposed testing this by removing
FGP_ACCESSED in grab_cache_page_write_begin.  I ran tests with that,
after removing the aforementioned code to issue posix_fadvise(DONTNEED)
in postgres.

base (4.5-rc2+10)
        latency average = 3.079 ms
        latency stddev = 8.269 ms
        tps = 10384.545914 (including connections establishing)
        tps = 10384.866341 (excluding connections establishing)


inactive/active patch:
        latency average = 2.931 ms
        latency stddev = 7.683 ms
        tps = 10908.905039 (including connections establishing)
        tps = 10909.256946 (excluding connections establishing)


inactive/active patch + no FGP_ACCESSED in grab_cache_page_write_begin:
        latency average = 2.806 ms
        latency stddev = 7.871 ms
        tps = 11392.893213 (including connections establishing)
        tps = 11393.839826 (excluding connections establishing)


Here the active/inactive lists didn't change as much as I hoped. A bit
of reading made it apparent that the workingset logic in
add_to_page_cache_lru() defated that attempt, by moving an previously
discarded page directly into the active list. I added a variant of
add_to_page_cache_lru() that accepts fgp_flags and only does the
workingset check if FGP_ACCESSED is set. That results in:

inactive/active patch + no FGP_ACCESSED in grab_cache_page_write_begin * add_to_page_cache_lru:
        latency average: 2.292 ms
        latency stddev: 6.487 ms
        tps = 13940.530898 (including connections establishing)
        tps = 13941.774874 (excluding connections establishing)

that's only slightly worse than doing explicit posix_fadvise(DONTNEED)
calls... Pretty good.

To make an actually usable patch out of this it seems we'd have to add a
'partial' argument to grab_cache_page_write_begin(), so writes to parts
of a page still cause the pages to be marked active.  Is it preferrable
to change all callers of grab_cache_page_write_begin and
add_to_page_cache_lru or make them into wrapper functions, and call the
real deal when it matters?

I do think that that's a reasonable algorithmic change, but nonetheless
its obviously possible that such changes regress some workloads. What's
the policy around testing such things?

Greetings,

Andres Freund

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-11 20:34     ` Rik van Riel
@ 2016-02-12 20:24       ` Johannes Weiner
  -1 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2016-02-12 20:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andres Freund, linux-mm, linux-kernel, Vlastimil Babka

On Thu, Feb 11, 2016 at 03:34:04PM -0500, Rik van Riel wrote:
> On Tue, 9 Feb 2016 17:42:56 -0500
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Feb 09, 2016 at 05:52:40PM +0100, Andres Freund wrote:
> 
> > > Rik asked me about active/inactive sizing in /proc/meminfo:
> > > Active:          7860556 kB
> > > Inactive:        5395644 kB
> > > Active(anon):    2874936 kB
> > > Inactive(anon):   432308 kB
> > > Active(file):    4985620 kB
> > > Inactive(file):  4963336 kB
> 
> > Yes, a generous minimum size of the inactive list made sense when it
> > was the exclusive staging area to tell use-once pages from use-many
> > pages. Now that we have refault information to detect use-many with
> > arbitrary inactive list size, this minimum is no longer reasonable.
> > 
> > The new minimum should be smaller, but big enough for applications to
> > actually use the data in their pages between fault and eviction
> > (i.e. it needs to take the aggregate readahead window into account),
> > and big enough for active pages that are speculatively challenged
> > during workingset changes to get re-activated without incurring IO.
> > 
> > However, I don't think it makes sense to dynamically adjust the
> > balance between the active and the inactive cache during refaults.
> 
> Johannes, does this patch look ok to you?

Yes, the anon ratio we use looks like a good fit for file as well.

I've updated the patch to work with cgroups.

>From 4d8ab2e69e16b7b59551e796283c046149698e9b Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@redhat.com>
Date: Fri, 12 Feb 2016 15:15:26 -0500
Subject: [PATCH] mm,vmscan: reduce size of inactive file list

The inactive file list should still be large enough to contain
readahead windows and freshly written file data, but it no
longer is the only source for detecting multiple accesses to
file pages. The workingset refault measurement code causes
recently evicted file pages that get accessed again after a
shorter interval to be promoted directly to the active list.

With that mechanism in place, we can afford to (on a larger
system) dedicate more memory to the active file list, so we
can actually cache more of the frequently used file pages
in memory, and not have them pushed out by streaming writes,
once-used streaming file reads, etc.

This can help things like database workloads, where only
half the page cache can currently be used to cache the
database working set. This patch automatically increases
that fraction on larger systems, using the same ratio that
has already been used for anonymous memory.

[hannes@cmpxchg.org: cgroup-awareness]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
---
 include/linux/memcontrol.h |  25 -----------
 mm/page_alloc.c            |  44 -------------------
 mm/vmscan.c                | 104 ++++++++++++++++++---------------------------
 3 files changed, 42 insertions(+), 131 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1191d79..3694f88 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -415,25 +415,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
 	return mz->lru_size[lru];
 }
 
-static inline bool mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
-{
-	unsigned long inactive_ratio;
-	unsigned long inactive;
-	unsigned long active;
-	unsigned long gb;
-
-	inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON);
-	active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON);
-
-	gb = (inactive + active) >> (30 - PAGE_SHIFT);
-	if (gb)
-		inactive_ratio = int_sqrt(10 * gb);
-	else
-		inactive_ratio = 1;
-
-	return inactive * inactive_ratio < active;
-}
-
 void mem_cgroup_handle_over_high(void);
 
 void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -646,12 +627,6 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
 	return true;
 }
 
-static inline bool
-mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
-{
-	return true;
-}
-
 static inline unsigned long
 mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c849a20..9269736e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6439,49 +6439,6 @@ void setup_per_zone_wmarks(void)
 }
 
 /*
- * The inactive anon list should be small enough that the VM never has to
- * do too much work, but large enough that each inactive page has a chance
- * to be referenced again before it is swapped out.
- *
- * The inactive_anon ratio is the target ratio of ACTIVE_ANON to
- * INACTIVE_ANON pages on this zone's LRU, maintained by the
- * pageout code. A zone->inactive_ratio of 3 means 3:1 or 25% of
- * the anonymous pages are kept on the inactive list.
- *
- * total     target    max
- * memory    ratio     inactive anon
- * -------------------------------------
- *   10MB       1         5MB
- *  100MB       1        50MB
- *    1GB       3       250MB
- *   10GB      10       0.9GB
- *  100GB      31         3GB
- *    1TB     101        10GB
- *   10TB     320        32GB
- */
-static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
-{
-	unsigned int gb, ratio;
-
-	/* Zone size in gigabytes */
-	gb = zone->managed_pages >> (30 - PAGE_SHIFT);
-	if (gb)
-		ratio = int_sqrt(10 * gb);
-	else
-		ratio = 1;
-
-	zone->inactive_ratio = ratio;
-}
-
-static void __meminit setup_per_zone_inactive_ratio(void)
-{
-	struct zone *zone;
-
-	for_each_zone(zone)
-		calculate_zone_inactive_ratio(zone);
-}
-
-/*
  * Initialise min_free_kbytes.
  *
  * For small machines we want it small (128k min).  For large machines
@@ -6526,7 +6483,6 @@ int __meminit init_per_zone_wmark_min(void)
 	setup_per_zone_wmarks();
 	refresh_zone_stat_thresholds();
 	setup_per_zone_lowmem_reserve();
-	setup_per_zone_inactive_ratio();
 	return 0;
 }
 module_init(init_per_zone_wmark_min)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 86eb214..1585579 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1865,83 +1865,63 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	free_hot_cold_page_list(&l_hold, true);
 }
 
-#ifdef CONFIG_SWAP
-static bool inactive_anon_is_low_global(struct zone *zone)
-{
-	unsigned long active, inactive;
-
-	active = zone_page_state(zone, NR_ACTIVE_ANON);
-	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-
-	return inactive * zone->inactive_ratio < active;
-}
-
-/**
- * inactive_anon_is_low - check if anonymous pages need to be deactivated
- * @lruvec: LRU vector to check
+/*
+ * The inactive anon list should be small enough that the VM never has
+ * to do too much work.
  *
- * Returns true if the zone does not have enough inactive anon pages,
- * meaning some active anon pages need to be deactivated.
- */
-static bool inactive_anon_is_low(struct lruvec *lruvec)
-{
-	/*
-	 * If we don't have swap space, anonymous page deactivation
-	 * is pointless.
-	 */
-	if (!total_swap_pages)
-		return false;
-
-	if (!mem_cgroup_disabled())
-		return mem_cgroup_inactive_anon_is_low(lruvec);
-
-	return inactive_anon_is_low_global(lruvec_zone(lruvec));
-}
-#else
-static inline bool inactive_anon_is_low(struct lruvec *lruvec)
-{
-	return false;
-}
-#endif
-
-/**
- * inactive_file_is_low - check if file pages need to be deactivated
- * @lruvec: LRU vector to check
+ * The inactive file list should be small enough to leave most memory
+ * to the established workingset on the scan-resistant active list,
+ * but large enough to avoid thrashing the aggregate readahead window.
  *
- * When the system is doing streaming IO, memory pressure here
- * ensures that active file pages get deactivated, until more
- * than half of the file pages are on the inactive list.
+ * Both inactive lists should also be large enough that each inactive
+ * page has a chance to be referenced again before it is reclaimed.
  *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
+ * The inactive_ratio is the target ratio of ACTIVE to INACTIVE pages
+ * on this LRU, maintained by the pageout code. A zone->inactive_ratio
+ * of 3 means 3:1 or 25% of the pages are kept on the inactive list.
  *
- * This uses a different ratio than the anonymous pages, because
- * the page cache uses a use-once replacement algorithm.
+ * total     target    max
+ * memory    ratio     inactive
+ * -------------------------------------
+ *   10MB       1         5MB
+ *  100MB       1        50MB
+ *    1GB       3       250MB
+ *   10GB      10       0.9GB
+ *  100GB      31         3GB
+ *    1TB     101        10GB
+ *   10TB     320        32GB
  */
-static bool inactive_file_is_low(struct lruvec *lruvec)
+static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
 {
+	unsigned long inactive_ratio;
 	unsigned long inactive;
 	unsigned long active;
+	unsigned long gb;
 
-	inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
-	active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
+	/*
+	 * If we don't have swap space, anonymous page deactivation
+	 * is pointless.
+	 */
+	if (!file && !total_swap_pages)
+		return false;
 
-	return active > inactive;
-}
+	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
+	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
 
-static bool inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
-{
-	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+	gb = (inactive + active) >> (30 - PAGE_SHIFT);
+	if (gb)
+		inactive_ratio = int_sqrt(10 * gb);
 	else
-		return inactive_anon_is_low(lruvec);
+		inactive_ratio = 1;
+
+	return inactive * inactive_ratio < active;
 }
 
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, is_file_lru(lru)))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -2062,7 +2042,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * lruvec even if it has plenty of old anonymous pages unless the
 	 * system is under heavy pressure.
 	 */
-	if (!inactive_file_is_low(lruvec) &&
+	if (!inactive_list_is_low(lruvec, true) &&
 	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
@@ -2304,7 +2284,7 @@ static void shrink_zone_memcg(struct zone *zone, struct mem_cgroup *memcg,
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_anon_is_low(lruvec))
+	if (inactive_list_is_low(lruvec, false))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 
@@ -2948,7 +2928,7 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	do {
 		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		if (inactive_anon_is_low(lruvec))
+		if (inactive_list_is_low(lruvec, false))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 					   sc, LRU_ACTIVE_ANON);
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-12 20:24       ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2016-02-12 20:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andres Freund, linux-mm, linux-kernel, Vlastimil Babka

On Thu, Feb 11, 2016 at 03:34:04PM -0500, Rik van Riel wrote:
> On Tue, 9 Feb 2016 17:42:56 -0500
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Feb 09, 2016 at 05:52:40PM +0100, Andres Freund wrote:
> 
> > > Rik asked me about active/inactive sizing in /proc/meminfo:
> > > Active:          7860556 kB
> > > Inactive:        5395644 kB
> > > Active(anon):    2874936 kB
> > > Inactive(anon):   432308 kB
> > > Active(file):    4985620 kB
> > > Inactive(file):  4963336 kB
> 
> > Yes, a generous minimum size of the inactive list made sense when it
> > was the exclusive staging area to tell use-once pages from use-many
> > pages. Now that we have refault information to detect use-many with
> > arbitrary inactive list size, this minimum is no longer reasonable.
> > 
> > The new minimum should be smaller, but big enough for applications to
> > actually use the data in their pages between fault and eviction
> > (i.e. it needs to take the aggregate readahead window into account),
> > and big enough for active pages that are speculatively challenged
> > during workingset changes to get re-activated without incurring IO.
> > 
> > However, I don't think it makes sense to dynamically adjust the
> > balance between the active and the inactive cache during refaults.
> 
> Johannes, does this patch look ok to you?

Yes, the anon ratio we use looks like a good fit for file as well.

I've updated the patch to work with cgroups.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-12 19:35         ` Andres Freund
@ 2016-02-16 19:29           ` Johannes Weiner
  -1 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2016-02-16 19:29 UTC (permalink / raw)
  To: Andres Freund; +Cc: Rik van Riel, linux-mm, linux-kernel, Vlastimil Babka

On Fri, Feb 12, 2016 at 08:35:53PM +0100, Andres Freund wrote:
> To make an actually usable patch out of this it seems we'd have to add a
> 'partial' argument to grab_cache_page_write_begin(), so writes to parts
> of a page still cause the pages to be marked active.  Is it preferrable
> to change all callers of grab_cache_page_write_begin and
> add_to_page_cache_lru or make them into wrapper functions, and call the
> real deal when it matters?

Personally, I'd prefer explicit arguments over another layer of
wrappers, especially in the add_to_page_cache family. But it's
possible others will disagree and only voice their opinion once you
went through the hassle and sent a patch.

> I do think that that's a reasonable algorithmic change, but nonetheless
> its obviously possible that such changes regress some workloads. What's
> the policy around testing such things?

How about a FGP_WRITE that only sets the page's referenced bit, but
doesn't activate or refault-activate the page?

That way, pages that are only ever written would never get activated,
but a single read mixed in would activate the page straightaway;
either in mark_page_accessed() or through refault-activation.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-16 19:29           ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2016-02-16 19:29 UTC (permalink / raw)
  To: Andres Freund; +Cc: Rik van Riel, linux-mm, linux-kernel, Vlastimil Babka

On Fri, Feb 12, 2016 at 08:35:53PM +0100, Andres Freund wrote:
> To make an actually usable patch out of this it seems we'd have to add a
> 'partial' argument to grab_cache_page_write_begin(), so writes to parts
> of a page still cause the pages to be marked active.  Is it preferrable
> to change all callers of grab_cache_page_write_begin and
> add_to_page_cache_lru or make them into wrapper functions, and call the
> real deal when it matters?

Personally, I'd prefer explicit arguments over another layer of
wrappers, especially in the add_to_page_cache family. But it's
possible others will disagree and only voice their opinion once you
went through the hassle and sent a patch.

> I do think that that's a reasonable algorithmic change, but nonetheless
> its obviously possible that such changes regress some workloads. What's
> the policy around testing such things?

How about a FGP_WRITE that only sets the page's referenced bit, but
doesn't activate or refault-activate the page?

That way, pages that are only ever written would never get activated,
but a single read mixed in would activate the page straightaway;
either in mark_page_accessed() or through refault-activation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-12 19:35         ` Andres Freund
@ 2016-02-17 21:17           ` Rik van Riel
  -1 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-02-17 21:17 UTC (permalink / raw)
  To: Andres Freund; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

On Fri, 12 Feb 2016 20:35:53 +0100
Andres Freund <andres@anarazel.de> wrote:

> On 2016-02-12 13:46:53 +0100, Andres Freund wrote:
> > I'm wondering why pages that are repeatedly written to, in units above
> > the page size, are promoted to the active list? I mean if there never
> > are any reads or re-dirtying an already-dirty page, what's the benefit
> > of moving that page onto the active list?
> 
> We chatted about this on IRC and you proposed testing this by removing
> FGP_ACCESSED in grab_cache_page_write_begin.  I ran tests with that,
> after removing the aforementioned code to issue posix_fadvise(DONTNEED)
> in postgres.

That looks promising.

> Here the active/inactive lists didn't change as much as I hoped. A bit
> of reading made it apparent that the workingset logic in
> add_to_page_cache_lru() defated that attempt,

The patch below should help with that.

Does the GFP_ACCESSED change still help with the patch
below applied?

If so, we can add the partial write logic everywhere,
and use it here, too.

---8<---

Subject: mm,workingset: only do workingset activations on reads

When rewriting a page, the data in that page is replaced with new
data. This means that evicting something else from the active file
list, in order to cache data that will be replaced by something else,
is likely to be a waste of memory.

It is better to save the active list for frequently read pages, because
reads actually use the data that is in the page.

This patch ignores partial writes, because it is unclear whether the
complexity of identifying those is worth any potential performance
gain obtained from better caching pages that see repeated partial
writes at large enough intervals to not get caught by the use-twice
promotion code used for the inactive file list.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Andres Freund <andres@anarazel.de>
---
 mm/filemap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bc943867d68c..1235d27b2c97 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -703,8 +703,12 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * The page might have been evicted from cache only
 		 * recently, in which case it should be activated like
 		 * any other repeatedly accessed page.
+		 * The exception is pages getting rewritten; evicting other
+		 * data from the working set, only to cache data that will
+		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (shadow && workingset_refault(shadow)) {
+		if (shadow && !(gfp_mask & GFP_WRITE) &&
+					workingset_refault(shadow)) {
 			SetPageActive(page);
 			workingset_activation(page);
 		} else

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-17 21:17           ` Rik van Riel
  0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2016-02-17 21:17 UTC (permalink / raw)
  To: Andres Freund; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

On Fri, 12 Feb 2016 20:35:53 +0100
Andres Freund <andres@anarazel.de> wrote:

> On 2016-02-12 13:46:53 +0100, Andres Freund wrote:
> > I'm wondering why pages that are repeatedly written to, in units above
> > the page size, are promoted to the active list? I mean if there never
> > are any reads or re-dirtying an already-dirty page, what's the benefit
> > of moving that page onto the active list?
> 
> We chatted about this on IRC and you proposed testing this by removing
> FGP_ACCESSED in grab_cache_page_write_begin.  I ran tests with that,
> after removing the aforementioned code to issue posix_fadvise(DONTNEED)
> in postgres.

That looks promising.

> Here the active/inactive lists didn't change as much as I hoped. A bit
> of reading made it apparent that the workingset logic in
> add_to_page_cache_lru() defated that attempt,

The patch below should help with that.

Does the GFP_ACCESSED change still help with the patch
below applied?

If so, we can add the partial write logic everywhere,
and use it here, too.

---8<---

Subject: mm,workingset: only do workingset activations on reads

When rewriting a page, the data in that page is replaced with new
data. This means that evicting something else from the active file
list, in order to cache data that will be replaced by something else,
is likely to be a waste of memory.

It is better to save the active list for frequently read pages, because
reads actually use the data that is in the page.

This patch ignores partial writes, because it is unclear whether the
complexity of identifying those is worth any potential performance
gain obtained from better caching pages that see repeated partial
writes at large enough intervals to not get caught by the use-twice
promotion code used for the inactive file list.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Andres Freund <andres@anarazel.de>
---
 mm/filemap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bc943867d68c..1235d27b2c97 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -703,8 +703,12 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * The page might have been evicted from cache only
 		 * recently, in which case it should be activated like
 		 * any other repeatedly accessed page.
+		 * The exception is pages getting rewritten; evicting other
+		 * data from the working set, only to cache data that will
+		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (shadow && workingset_refault(shadow)) {
+		if (shadow && !(gfp_mask & GFP_WRITE) &&
+					workingset_refault(shadow)) {
 			SetPageActive(page);
 			workingset_activation(page);
 		} else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-12 20:24       ` Johannes Weiner
@ 2016-02-19 22:07         ` Andres Freund
  -1 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-19 22:07 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Rik van Riel, linux-mm, linux-kernel, Vlastimil Babka

Hi Johannes,

On 2016-02-12 15:24:05 -0500, Johannes Weiner wrote:
> I've updated the patch to work with cgroups.

Are tests of this patch, in contrast to the earlier version, necessary?
If so, what's this patch based upon? Because it doesn't seem to apply
cleanly to any tree I know about. Not very hard to resolve conflicts,
mind you, but ...

- Andres

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-19 22:07         ` Andres Freund
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-19 22:07 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Rik van Riel, linux-mm, linux-kernel, Vlastimil Babka

Hi Johannes,

On 2016-02-12 15:24:05 -0500, Johannes Weiner wrote:
> I've updated the patch to work with cgroups.

Are tests of this patch, in contrast to the earlier version, necessary?
If so, what's this patch based upon? Because it doesn't seem to apply
cleanly to any tree I know about. Not very hard to resolve conflicts,
mind you, but ...

- Andres

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
  2016-02-17 21:17           ` Rik van Riel
@ 2016-02-19 22:19             ` Andres Freund
  -1 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-19 22:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

On 2016-02-17 16:17:44 -0500, Rik van Riel wrote:
> On Fri, 12 Feb 2016 20:35:53 +0100
> Andres Freund <andres@anarazel.de> wrote:
> 
> > On 2016-02-12 13:46:53 +0100, Andres Freund wrote:
> > > I'm wondering why pages that are repeatedly written to, in units above
> > > the page size, are promoted to the active list? I mean if there never
> > > are any reads or re-dirtying an already-dirty page, what's the benefit
> > > of moving that page onto the active list?
> > 
> > We chatted about this on IRC and you proposed testing this by removing
> > FGP_ACCESSED in grab_cache_page_write_begin.  I ran tests with that,
> > after removing the aforementioned code to issue posix_fadvise(DONTNEED)
> > in postgres.
> 
> That looks promising.

Indeed.


> > Here the active/inactive lists didn't change as much as I hoped. A bit
> > of reading made it apparent that the workingset logic in
> > add_to_page_cache_lru() defated that attempt,
> 
> The patch below should help with that.
> 
> Does the GFP_ACCESSED change still help with the patch
> below applied?

I've not yet run any tests, but I'd earlier used perf probes to see
where pages got activated, and I saw activations from both places. So
presumably there'd be a difference; i.e. ISTM we need to change both
places.


Regards,

Andres

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-19 22:19             ` Andres Freund
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-19 22:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Johannes Weiner, linux-mm, linux-kernel, Vlastimil Babka

On 2016-02-17 16:17:44 -0500, Rik van Riel wrote:
> On Fri, 12 Feb 2016 20:35:53 +0100
> Andres Freund <andres@anarazel.de> wrote:
> 
> > On 2016-02-12 13:46:53 +0100, Andres Freund wrote:
> > > I'm wondering why pages that are repeatedly written to, in units above
> > > the page size, are promoted to the active list? I mean if there never
> > > are any reads or re-dirtying an already-dirty page, what's the benefit
> > > of moving that page onto the active list?
> > 
> > We chatted about this on IRC and you proposed testing this by removing
> > FGP_ACCESSED in grab_cache_page_write_begin.  I ran tests with that,
> > after removing the aforementioned code to issue posix_fadvise(DONTNEED)
> > in postgres.
> 
> That looks promising.

Indeed.


> > Here the active/inactive lists didn't change as much as I hoped. A bit
> > of reading made it apparent that the workingset logic in
> > add_to_page_cache_lru() defated that attempt,
> 
> The patch below should help with that.
> 
> Does the GFP_ACCESSED change still help with the patch
> below applied?

I've not yet run any tests, but I'd earlier used perf probes to see
where pages got activated, and I saw activations from both places. So
presumably there'd be a difference; i.e. ISTM we need to change both
places.


Regards,

Andres

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-02-19 22:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-09 16:52 Unhelpful caching decisions, possibly related to active/inactive sizing Andres Freund
2016-02-09 16:52 ` Andres Freund
2016-02-09 22:42 ` Johannes Weiner
2016-02-09 22:42   ` Johannes Weiner
2016-02-11 20:34   ` Rik van Riel
2016-02-11 20:34     ` Rik van Riel
2016-02-12 12:46     ` Andres Freund
2016-02-12 12:46       ` Andres Freund
2016-02-12 19:35       ` Andres Freund
2016-02-12 19:35         ` Andres Freund
2016-02-16 19:29         ` Johannes Weiner
2016-02-16 19:29           ` Johannes Weiner
2016-02-17 21:17         ` Rik van Riel
2016-02-17 21:17           ` Rik van Riel
2016-02-19 22:19           ` Andres Freund
2016-02-19 22:19             ` Andres Freund
2016-02-12 12:56     ` Andres Freund
2016-02-12 12:56       ` Andres Freund
2016-02-12 20:24     ` Johannes Weiner
2016-02-12 20:24       ` Johannes Weiner
2016-02-19 22:07       ` Andres Freund
2016-02-19 22:07         ` Andres Freund

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.