All of lore.kernel.org
 help / color / mirror / Atom feed
* Unhelpful caching decisions, possibly related to active/inactive sizing
@ 2016-02-09 16:52 ` Andres Freund
  0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2016-02-09 16:52 UTC (permalink / raw)
  To: Johannes Weiner, Rik van Riel; +Cc: linux-mm, linux-kernel, Vlastimil Babka

Hi,

I'm working on fixing long IO stalls with postgres. After some
architectural changes fixing the worst issues, I noticed that indivdiual
processes/backends/connections still spend more time waiting than I'd
expect.

In an workload with the hot data set fitting into memory (2GB of
mmap(HUGE|ANNON) shared memory for postgres buffer cache, ~6GB of
dataset, 16GB total memory) I found that there's more reads hitting disk
that I'd expect.  That's after I've led Vlastimil on IRC down a wrong
rabbithole, sorry for that.

Some tinkering and question later, the issue appears to be postgres'
journal/WAL. Which in the test-setup is write-only, and only touched
again when individual segments of the WAL are reused. Which, in the
configuration I'm using, only happens after ~20min and 30GB later or so.
Drastically reducing the volume of WAL through some (unsafe)
configuration options, or forcing the WAL to be written using O_DIRECT,
changes the workload to be fully cached.

Rik asked me about active/inactive sizing in /proc/meminfo:
Active:          7860556 kB
Inactive:        5395644 kB
Active(anon):    2874936 kB
Inactive(anon):   432308 kB
Active(file):    4985620 kB
Inactive(file):  4963336 kB

and then said:

riel   | the workingset stuff does not appear to be taken into account for active/inactive list sizing, in vmscan.c
riel   | I suspect we will want to expand the vmscan.c code, to take the workingset stats into account
riel   | when we re-fault a page that was on the active list before, we want to grow the size of the active list (and
       | shrink from inactive)
riel   | when we re-fault a page that was never active, we need to grow the size of the inactive list (and shrink
       | active)
riel   | but I don't think we have any bits free in page flags for that, we may need to improvise something :)

andres | Ok, at this point I'm kinda out of my depth here ;)

riel   | andres: basically active & inactive file LRUs are kept at the same size currently
riel   | andres: which means anything that overflows half of memory will get flushed out of the cache by large write
       | volumes (to the write-only log)
riel   | andres: what we should do is dynamically size the active & inactive file lists, depending on which of the two
       | needs more caching
riel   | andres: if we never re-use the inactive pages that get flushed out, there's no sense in caching more of them
       | (and we could dedicate more memory to the active list, instead)

andres | Sounds sensible. I guess things get really tricky if there's a portion of the inactive list that does get
       | reused (say if the hot data set is larger than memory), and another doesn't get reused at all.

I promised to send an email about the issue...

I provide you with a branch of postgres + instructions to reproduce the
issue, or I can test patches, whatever you prefer.

This test was run using 4.5.0-rc2, but I doubt this is a recent
regression or such.

Any other information I can provide you with?

Regards,

Andres

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-02-19 22:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-09 16:52 Unhelpful caching decisions, possibly related to active/inactive sizing Andres Freund
2016-02-09 16:52 ` Andres Freund
2016-02-09 22:42 ` Johannes Weiner
2016-02-09 22:42   ` Johannes Weiner
2016-02-11 20:34   ` Rik van Riel
2016-02-11 20:34     ` Rik van Riel
2016-02-12 12:46     ` Andres Freund
2016-02-12 12:46       ` Andres Freund
2016-02-12 19:35       ` Andres Freund
2016-02-12 19:35         ` Andres Freund
2016-02-16 19:29         ` Johannes Weiner
2016-02-16 19:29           ` Johannes Weiner
2016-02-17 21:17         ` Rik van Riel
2016-02-17 21:17           ` Rik van Riel
2016-02-19 22:19           ` Andres Freund
2016-02-19 22:19             ` Andres Freund
2016-02-12 12:56     ` Andres Freund
2016-02-12 12:56       ` Andres Freund
2016-02-12 20:24     ` Johannes Weiner
2016-02-12 20:24       ` Johannes Weiner
2016-02-19 22:07       ` Andres Freund
2016-02-19 22:07         ` Andres Freund

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.