All of lore.kernel.org
 help / color / mirror / Atom feed
* Sudden and massive page cache eviction
@ 2010-11-12 16:20 Peter Schüller
  2010-11-23  0:11   ` Andrew Morton
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Schüller @ 2010-11-12 16:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Mattias de Zalenski

Hello,

We have been seeing sudden and repeated evictions of huge amounts of
page cache on some of our servers for reasons that we cannot explain.
We are hoping that someone familiar with the vm subsystem may be able
to shed some light on the issue and perhaps confirm whether it is
plausibly a kernel bug or not. I will try to present the information
most-important-first, but this post will unavoidable be a bit long -
sorry.

First, here is a good example of the symptom (more graphs later on):

   http://files.spotify.com/memcut/b_daily_allcut.png

After looking into this we have seen similar incidents on servers
running completely different software; but in this particular case
this machine is running a service which is heavily dependent on the
buffer cache to deal with incoming request load. The direct effects of
these is that we end up in complete I/O saturation (average queue
depth goes to 150-250 and stays there indefinitely or until we
actively tweak it (warm up caches etc)). Our interpretation of that is
that the eviction is not the result of something along the lines of a
large file being removed; given the effects on I/O load it is clear
that the data being evicted is in fact part of the active set used by
the service running on the machine.

The I/O load on these systems comes mainly from two things:

  (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
  (2) Seek-bound I/O generated by traversal of prefix directory trees
(i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
  (3) Seek-bound I/O reading small segments of small-to-medium sized
files contained in the prefix tree.

The prefix tree consist of 8*2^16 directory entries in total, with
individual files being in the tens of millions per server.

We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
experimental repo). While it initially looked like it was behaving
better, it slowly reverted to not making a difference (maybe as a
function of uptime, but we have not had the opportunity to test this
by re-booting some of them so it is an untested hypothesis).

Most of the activity on this system (ignoring the usual stuff like
ssh/cron/syslog/etc) is coming from Python processes that consume
non-trivial amounts of heap space, plus the disk activity and some
POSIX shared memory caching utilized by the BDB library.

We have correlated the incidence of these page eviction with higher
loads on the system; i.e., it tends to happen under high-load periods
and in addition we tend to see additional machines having problems as
a result of us "fixing" a machine that experienced an eviction (we
have some limited cascading effects that causes slightly higher load
on other servers in the cluster when we do that).

We believe the most plausible way an application bug could trigger
this behavior would require that (1) the application allocates the
memory, and (2) actually touches the pages. We believe this to be
unlikely in this case because:

  (1) We see similar sudden evictions on various other servers, which
we noticed when we started looking for them.
  (2) The fact that it tends to trigger correlated with load suggests
that it is not a functional bug in the service as such as higher load
is in this case unlikely to trigger any paths that does anything
unique with respect to memory allocation. In particular because the
domain logic is all Python, and none of it really deals with data
chunks.
  (3) If we did manage to allocate something in the Python heap, we
would have to be "lucky" (or unlucky) if Python were consistently able
to munmap()/brk() down afterwards.

Some additional "sample" graphs showing a few incidences of the problem:

   http://files.spotify.com/memcut/a_daily.png
   http://files.spotify.com/memcut/a_weekly.png
   http://files.spotify.com/memcut/b_daily_allcut.png
   http://files.spotify.com/memcut/c_monthly.png
   http://files.spotify.com/memcut/c_yearly.png
   http://files.spotify.com/memcut/d_monthly.png
   http://files.spotify.com/memcut/d_yearly.png
   http://files.spotify.com/memcut/a_monthly.png
   http://files.spotify.com/memcut/a_yearly.png
   http://files.spotify.com/memcut/c_daily.png
   http://files.spotify.com/memcut/c_weekly.png
   http://files.spotify.com/memcut/d_daily.png
   http://files.spotify.com/memcut/d_weekly.png

And here is an example from a server only running PostgreSQL (where
the sudden drop of gigabytes of page cache is unlikely because we are
not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
nor do we have a use-case which will imply ftruncate() on table
files):

   http://files.spotify.com/memcut/postgresql_weekly.png

As you can see it's not as significant there, but it seems to, at
least visually, be the same "type" of effect. We've seen similar on
various machines, although depending on service running it may or may
not be explainable by regular file removal.

Further, we have observed the kernel's unwillingness to retain data in
page cache under interesting circumstances:

(1) page cache eviction happens
(2) we warm up our BDB files by cat:ing them (simple but effective)
(3) within a matter of minutes, while there is still several GB of
free (truly free, not page cached), these are evicted (as evidenced by
re-cat:ing them a little while later)

This latest observation we understand may be due to NUMA related
allocation issues, and we should probably try to use numactl to ask
for a more even allocation. We have not yet tried this. However, it is
not clear how any issues having to do with that would cause sudden
eviction of data already *in* the page cache (on whichever node).

-- 
/ Peter Schuller aka scode

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-12 16:20 Sudden and massive page cache eviction Peter Schüller
@ 2010-11-23  0:11   ` Andrew Morton
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2010-11-23  0:11 UTC (permalink / raw)
  To: Peter Schüller; +Cc: linux-kernel, Mattias de Zalenski, linux-mm

(cc linux-mm)

On Fri, 12 Nov 2010 17:20:21 +0100
Peter Sch__ller <scode@spotify.com> wrote:

> Hello,
> 
> We have been seeing sudden and repeated evictions of huge amounts of
> page cache on some of our servers for reasons that we cannot explain.
> We are hoping that someone familiar with the vm subsystem may be able
> to shed some light on the issue and perhaps confirm whether it is
> plausibly a kernel bug or not. I will try to present the information
> most-important-first, but this post will unavoidable be a bit long -
> sorry.
> 
> First, here is a good example of the symptom (more graphs later on):
> 
>    http://files.spotify.com/memcut/b_daily_allcut.png
> 
> After looking into this we have seen similar incidents on servers
> running completely different software; but in this particular case
> this machine is running a service which is heavily dependent on the
> buffer cache to deal with incoming request load. The direct effects of
> these is that we end up in complete I/O saturation (average queue
> depth goes to 150-250 and stays there indefinitely or until we
> actively tweak it (warm up caches etc)). Our interpretation of that is
> that the eviction is not the result of something along the lines of a
> large file being removed; given the effects on I/O load it is clear
> that the data being evicted is in fact part of the active set used by
> the service running on the machine.
> 
> The I/O load on these systems comes mainly from two things:
> 
>   (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
>   (2) Seek-bound I/O generated by traversal of prefix directory trees
> (i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
>   (3) Seek-bound I/O reading small segments of small-to-medium sized
> files contained in the prefix tree.
> 
> The prefix tree consist of 8*2^16 directory entries in total, with
> individual files being in the tens of millions per server.
> 
> We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
> subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
> experimental repo). While it initially looked like it was behaving
> better, it slowly reverted to not making a difference (maybe as a
> function of uptime, but we have not had the opportunity to test this
> by re-booting some of them so it is an untested hypothesis).
> 
> Most of the activity on this system (ignoring the usual stuff like
> ssh/cron/syslog/etc) is coming from Python processes that consume
> non-trivial amounts of heap space, plus the disk activity and some
> POSIX shared memory caching utilized by the BDB library.
> 
> We have correlated the incidence of these page eviction with higher
> loads on the system; i.e., it tends to happen under high-load periods
> and in addition we tend to see additional machines having problems as
> a result of us "fixing" a machine that experienced an eviction (we
> have some limited cascading effects that causes slightly higher load
> on other servers in the cluster when we do that).
> 
> We believe the most plausible way an application bug could trigger
> this behavior would require that (1) the application allocates the
> memory, and (2) actually touches the pages. We believe this to be
> unlikely in this case because:
> 
>   (1) We see similar sudden evictions on various other servers, which
> we noticed when we started looking for them.
>   (2) The fact that it tends to trigger correlated with load suggests
> that it is not a functional bug in the service as such as higher load
> is in this case unlikely to trigger any paths that does anything
> unique with respect to memory allocation. In particular because the
> domain logic is all Python, and none of it really deals with data
> chunks.
>   (3) If we did manage to allocate something in the Python heap, we
> would have to be "lucky" (or unlucky) if Python were consistently able
> to munmap()/brk() down afterwards.
> 
> Some additional "sample" graphs showing a few incidences of the problem:
> 
>    http://files.spotify.com/memcut/a_daily.png
>    http://files.spotify.com/memcut/a_weekly.png
>    http://files.spotify.com/memcut/b_daily_allcut.png
>    http://files.spotify.com/memcut/c_monthly.png
>    http://files.spotify.com/memcut/c_yearly.png
>    http://files.spotify.com/memcut/d_monthly.png
>    http://files.spotify.com/memcut/d_yearly.png
>    http://files.spotify.com/memcut/a_monthly.png
>    http://files.spotify.com/memcut/a_yearly.png
>    http://files.spotify.com/memcut/c_daily.png
>    http://files.spotify.com/memcut/c_weekly.png
>    http://files.spotify.com/memcut/d_daily.png
>    http://files.spotify.com/memcut/d_weekly.png
> 
> And here is an example from a server only running PostgreSQL (where
> the sudden drop of gigabytes of page cache is unlikely because we are
> not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
> nor do we have a use-case which will imply ftruncate() on table
> files):
> 
>    http://files.spotify.com/memcut/postgresql_weekly.png
> 
> As you can see it's not as significant there, but it seems to, at
> least visually, be the same "type" of effect. We've seen similar on
> various machines, although depending on service running it may or may
> not be explainable by regular file removal.
> 
> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
> 
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
> 
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node).
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-23  0:11   ` Andrew Morton
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2010-11-23  0:11 UTC (permalink / raw)
  To: Peter Schüller; +Cc: linux-kernel, Mattias de Zalenski, linux-mm

(cc linux-mm)

On Fri, 12 Nov 2010 17:20:21 +0100
Peter Sch__ller <scode@spotify.com> wrote:

> Hello,
> 
> We have been seeing sudden and repeated evictions of huge amounts of
> page cache on some of our servers for reasons that we cannot explain.
> We are hoping that someone familiar with the vm subsystem may be able
> to shed some light on the issue and perhaps confirm whether it is
> plausibly a kernel bug or not. I will try to present the information
> most-important-first, but this post will unavoidable be a bit long -
> sorry.
> 
> First, here is a good example of the symptom (more graphs later on):
> 
>    http://files.spotify.com/memcut/b_daily_allcut.png
> 
> After looking into this we have seen similar incidents on servers
> running completely different software; but in this particular case
> this machine is running a service which is heavily dependent on the
> buffer cache to deal with incoming request load. The direct effects of
> these is that we end up in complete I/O saturation (average queue
> depth goes to 150-250 and stays there indefinitely or until we
> actively tweak it (warm up caches etc)). Our interpretation of that is
> that the eviction is not the result of something along the lines of a
> large file being removed; given the effects on I/O load it is clear
> that the data being evicted is in fact part of the active set used by
> the service running on the machine.
> 
> The I/O load on these systems comes mainly from two things:
> 
>   (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
>   (2) Seek-bound I/O generated by traversal of prefix directory trees
> (i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
>   (3) Seek-bound I/O reading small segments of small-to-medium sized
> files contained in the prefix tree.
> 
> The prefix tree consist of 8*2^16 directory entries in total, with
> individual files being in the tens of millions per server.
> 
> We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
> subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
> experimental repo). While it initially looked like it was behaving
> better, it slowly reverted to not making a difference (maybe as a
> function of uptime, but we have not had the opportunity to test this
> by re-booting some of them so it is an untested hypothesis).
> 
> Most of the activity on this system (ignoring the usual stuff like
> ssh/cron/syslog/etc) is coming from Python processes that consume
> non-trivial amounts of heap space, plus the disk activity and some
> POSIX shared memory caching utilized by the BDB library.
> 
> We have correlated the incidence of these page eviction with higher
> loads on the system; i.e., it tends to happen under high-load periods
> and in addition we tend to see additional machines having problems as
> a result of us "fixing" a machine that experienced an eviction (we
> have some limited cascading effects that causes slightly higher load
> on other servers in the cluster when we do that).
> 
> We believe the most plausible way an application bug could trigger
> this behavior would require that (1) the application allocates the
> memory, and (2) actually touches the pages. We believe this to be
> unlikely in this case because:
> 
>   (1) We see similar sudden evictions on various other servers, which
> we noticed when we started looking for them.
>   (2) The fact that it tends to trigger correlated with load suggests
> that it is not a functional bug in the service as such as higher load
> is in this case unlikely to trigger any paths that does anything
> unique with respect to memory allocation. In particular because the
> domain logic is all Python, and none of it really deals with data
> chunks.
>   (3) If we did manage to allocate something in the Python heap, we
> would have to be "lucky" (or unlucky) if Python were consistently able
> to munmap()/brk() down afterwards.
> 
> Some additional "sample" graphs showing a few incidences of the problem:
> 
>    http://files.spotify.com/memcut/a_daily.png
>    http://files.spotify.com/memcut/a_weekly.png
>    http://files.spotify.com/memcut/b_daily_allcut.png
>    http://files.spotify.com/memcut/c_monthly.png
>    http://files.spotify.com/memcut/c_yearly.png
>    http://files.spotify.com/memcut/d_monthly.png
>    http://files.spotify.com/memcut/d_yearly.png
>    http://files.spotify.com/memcut/a_monthly.png
>    http://files.spotify.com/memcut/a_yearly.png
>    http://files.spotify.com/memcut/c_daily.png
>    http://files.spotify.com/memcut/c_weekly.png
>    http://files.spotify.com/memcut/d_daily.png
>    http://files.spotify.com/memcut/d_weekly.png
> 
> And here is an example from a server only running PostgreSQL (where
> the sudden drop of gigabytes of page cache is unlikely because we are
> not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
> nor do we have a use-case which will imply ftruncate() on table
> files):
> 
>    http://files.spotify.com/memcut/postgresql_weekly.png
> 
> As you can see it's not as significant there, but it seems to, at
> least visually, be the same "type" of effect. We've seen similar on
> various machines, although depending on service running it may or may
> not be explainable by regular file removal.
> 
> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
> 
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
> 
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node).
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-23  0:11   ` Andrew Morton
@ 2010-11-23  8:38     ` Dave Hansen
  -1 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-11-23  8:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Schüller, linux-kernel, Mattias de Zalenski, linux-mm

On Mon, 2010-11-22 at 16:11 -0800, Andrew Morton wrote:
> > This latest observation we understand may be due to NUMA related
> > allocation issues, and we should probably try to use numactl to ask
> > for a more even allocation. We have not yet tried this. However, it
> is not clear how any issues having to do with that would cause sudden
> > eviction of data already *in* the page cache (on whichever node). 

You don't have anybody messing with /proc/sys/vm/drop_caches, do you?

That can cause massive, otherwise unprovoked page cache eviction.  

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-23  8:38     ` Dave Hansen
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-11-23  8:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Schüller, linux-kernel, Mattias de Zalenski, linux-mm

On Mon, 2010-11-22 at 16:11 -0800, Andrew Morton wrote:
> > This latest observation we understand may be due to NUMA related
> > allocation issues, and we should probably try to use numactl to ask
> > for a more even allocation. We have not yet tried this. However, it
> is not clear how any issues having to do with that would cause sudden
> > eviction of data already *in* the page cache (on whichever node). 

You don't have anybody messing with /proc/sys/vm/drop_caches, do you?

That can cause massive, otherwise unprovoked page cache eviction.  

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-23  8:38     ` Dave Hansen
@ 2010-11-23  9:44       ` Peter Schüller
  -1 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-23  9:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

> You don't have anybody messing with /proc/sys/vm/drop_caches, do you?

Highly unlikely given that (1) evictions, while often very
significant, are usually not *complete* (although the first graph
example I provided had a more or less complete eviction) and (2) the
evictions are not obviously periodic indicating some kind of cron job,
and (3) we see the evictions happening across a wide variety of
machines.

So yes, I feel confident that we are not accidentally doing that.

(FWIW though, drop_caches is great. I only recently found out about
it, and it's really useful when benchmarking.)

-- 
/ Peter Schuller aka scode

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-23  9:44       ` Peter Schüller
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-23  9:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

> You don't have anybody messing with /proc/sys/vm/drop_caches, do you?

Highly unlikely given that (1) evictions, while often very
significant, are usually not *complete* (although the first graph
example I provided had a more or less complete eviction) and (2) the
evictions are not obviously periodic indicating some kind of cron job,
and (3) we see the evictions happening across a wide variety of
machines.

So yes, I feel confident that we are not accidentally doing that.

(FWIW though, drop_caches is great. I only recently found out about
it, and it's really useful when benchmarking.)

-- 
/ Peter Schuller aka scode

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-23  9:44       ` Peter Schüller
@ 2010-11-23 16:19         ` Dave Hansen
  -1 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-11-23 16:19 UTC (permalink / raw)
  To: Peter Schüller
  Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

On Tue, 2010-11-23 at 10:44 +0100, Peter Schüller wrote:
> > You don't have anybody messing with /proc/sys/vm/drop_caches, do you?
> 
> Highly unlikely given that (1) evictions, while often very
> significant, are usually not *complete* (although the first graph
> example I provided had a more or less complete eviction) and (2) the
> evictions are not obviously periodic indicating some kind of cron job,
> and (3) we see the evictions happening across a wide variety of
> machines.
> 
> So yes, I feel confident that we are not accidentally doing that.

Yeah, drop_caches doesn't seem very likely.

Your postgres data looks the cleanest and is probably the easiest to
analyze.  Might as well start there:

	http://files.spotify.com/memcut/postgresql_weekly.png

As you said, it might not be the same as the others, but it's a decent
place to start.  If someone used drop_caches or if someone was randomly
truncating files, we'd expect to see the active/inactive lines both drop
by relatively equivalent amounts, and see them happen at _exactly_ the
same time as the cache eviction.  The eviction about 1/3 of the way
through Wednesday in the above graph kinda looks this way, but it's the
exception.

Just eyeballing it, _most_ of the evictions seem to happen after some
movement in the active/inactive lists.  We see an "inactive" uptick as
we start to launder pages, and the page activation doesn't keep up with
it.  This is a _bit_ weird since we don't see any slab cache or other
users coming to fill the new space.  Something _wanted_ the memory, so
why isn't it being used?

Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
allocations happening in the kernel?  

If you could start recording /proc/{vmstat,buddystat,meminfo,slabinfo},
it would be immensely useful.  The munin graphs are really great, but
they don't have the detail which you can get from stuff like vmstat.

> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
> 
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
>
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node)..

For a page-cache-heavy workload where you care a lot more about things
being _in_ cache rather than having good NUMA locality, you probably
want "zone_reclaim_mode" set to 0:

	http://www.kernel.org/doc/Documentation/sysctl/vm.txt

That'll be a bit more comprehensive than messing with numactl.  It
really is the best thing if you just don't care about NUMA latencies all
that much.  What kind of hardware is this, btw?

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-23 16:19         ` Dave Hansen
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-11-23 16:19 UTC (permalink / raw)
  To: Peter Schüller
  Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

On Tue, 2010-11-23 at 10:44 +0100, Peter Schuller wrote:
> > You don't have anybody messing with /proc/sys/vm/drop_caches, do you?
> 
> Highly unlikely given that (1) evictions, while often very
> significant, are usually not *complete* (although the first graph
> example I provided had a more or less complete eviction) and (2) the
> evictions are not obviously periodic indicating some kind of cron job,
> and (3) we see the evictions happening across a wide variety of
> machines.
> 
> So yes, I feel confident that we are not accidentally doing that.

Yeah, drop_caches doesn't seem very likely.

Your postgres data looks the cleanest and is probably the easiest to
analyze.  Might as well start there:

	http://files.spotify.com/memcut/postgresql_weekly.png

As you said, it might not be the same as the others, but it's a decent
place to start.  If someone used drop_caches or if someone was randomly
truncating files, we'd expect to see the active/inactive lines both drop
by relatively equivalent amounts, and see them happen at _exactly_ the
same time as the cache eviction.  The eviction about 1/3 of the way
through Wednesday in the above graph kinda looks this way, but it's the
exception.

Just eyeballing it, _most_ of the evictions seem to happen after some
movement in the active/inactive lists.  We see an "inactive" uptick as
we start to launder pages, and the page activation doesn't keep up with
it.  This is a _bit_ weird since we don't see any slab cache or other
users coming to fill the new space.  Something _wanted_ the memory, so
why isn't it being used?

Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
allocations happening in the kernel?  

If you could start recording /proc/{vmstat,buddystat,meminfo,slabinfo},
it would be immensely useful.  The munin graphs are really great, but
they don't have the detail which you can get from stuff like vmstat.

> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
> 
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
>
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node)..

For a page-cache-heavy workload where you care a lot more about things
being _in_ cache rather than having good NUMA locality, you probably
want "zone_reclaim_mode" set to 0:

	http://www.kernel.org/doc/Documentation/sysctl/vm.txt

That'll be a bit more comprehensive than messing with numactl.  It
really is the best thing if you just don't care about NUMA latencies all
that much.  What kind of hardware is this, btw?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-23 16:19         ` Dave Hansen
@ 2010-11-24 14:02           ` Peter Schüller
  -1 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-24 14:02 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

Hello,

first of all, thank you very much for taking the time to analyze the situation!

> Yeah, drop_caches doesn't seem very likely.
>
> Your postgres data looks the cleanest and is probably the easiest to
> analyze.  Might as well start there:
>
>        http://files.spotify.com/memcut/postgresql_weekly.png

Since you wanted to look at that primarily I re-visited that
particular case to confirm what was really happening. Unfortunately I
have to retract my claim here, because it turns out that we have
backups running locally on the machine before shipping them away, and
it seems that indeed the cache evictions on that machine are
correlated with the removal of said backup after it was shipped away
(and testing confirms the behavior).

Of course that is entirely expected (that removal of a recently
written file will cause a sudden spike in free memory) so the
PostgreSQL graph is a red herring.

This was unfortunate, and a result of me picking this guy fairly
ad-hoc for the purpose of summarizing the situation in my post to the
list. We have spent considerable time trying to make sure that the
evictions are indeed anomalous and not e.g. due to a file removal in
the case of the actual service where we are negatively affected, but I
was not sufficiently careful before proclaiming that we seem to see a
similar effects on other hosts.

It may still be the case, but I am not finding a case which is
sufficiently clear at this time (being sure requires really looking at
what is going on with each class of machine and eliminating various
forms of backups, log rotation and other behavior exhibited). This
also means that in general it's not certain that we are in fact seeing
this behavior on others at all.

However it does leave all other observations, including the very
direct correlation in time with load spikes and a lack of correlation
with backups jobs, and the fact that the eviction seems to be of
actively used data given the resulting I/O storm. So I feel confident
in saying that we definitely do have an actual issue (although as
previously indicated we have not proven conclusively that there is
absolutely no userspace application allocating and touching lots of
pages suddenly, but it seems very unlikely).

> Just eyeballing it, _most_ of the evictions seem to happen after some
> movement in the active/inactive lists.  We see an "inactive" uptick as
> we start to launder pages, and the page activation doesn't keep up with
> it.  This is a _bit_ weird since we don't see any slab cache or other
> users coming to fill the new space.  Something _wanted_ the memory, so
> why isn't it being used?

In this case we have the writing of a backup file (with corresponding
page touching for reading data). This is followed by a period of
reading the recently written file, followed by it being removed.

> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
> allocations happening in the kernel?

No; we're not using huge pages at all (not consciously). Looking at
/proc/meminfo I can confirm that we just see this:

HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

So hopefully that should not be a factor.

> If you could start recording /proc/{vmstat,buddystat,meminfo,slabinfo},
> it would be immensely useful.  The munin graphs are really great, but
> they don't have the detail which you can get from stuff like vmstat.

Absolutely. I'll get some recording of those going and run for
sufficient duration to correlate with page evictions.

> For a page-cache-heavy workload where you care a lot more about things
> being _in_ cache rather than having good NUMA locality, you probably
> want "zone_reclaim_mode" set to 0:
>
>        http://www.kernel.org/doc/Documentation/sysctl/vm.txt
>
> That'll be a bit more comprehensive than messing with numactl.  It
> really is the best thing if you just don't care about NUMA latencies all
> that much.

Thanks! That looks to be exactly what we would like in this case and,
if Interpret you and the documentation correctly, obviates the need to
ask for interleaved allocation.

>  What kind of hardware is this, btw?

It varies somewhat in age, but all of them are Intel. The oldest ones
have 16 GB of memory and are of this CPU type:

cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz
stepping	: 6
cpu MHz		: 2494.168
cache size	: 6144 KB

While newer ones have ~36 GB memory and are of this CPU type:

cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping	: 5
cpu MHz		: 2000.049
cache size	: 4096 KB

Some variation beyond that may exist, but that is the span (and all
are Intel, 8 cores or more).

numactl --show on older machines:

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0
nodebind: 0
membind: 0

And on newer machines:

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1

-- 
/ Peter Schuller aka scode

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-24 14:02           ` Peter Schüller
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-24 14:02 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

Hello,

first of all, thank you very much for taking the time to analyze the situation!

> Yeah, drop_caches doesn't seem very likely.
>
> Your postgres data looks the cleanest and is probably the easiest to
> analyze.  Might as well start there:
>
>        http://files.spotify.com/memcut/postgresql_weekly.png

Since you wanted to look at that primarily I re-visited that
particular case to confirm what was really happening. Unfortunately I
have to retract my claim here, because it turns out that we have
backups running locally on the machine before shipping them away, and
it seems that indeed the cache evictions on that machine are
correlated with the removal of said backup after it was shipped away
(and testing confirms the behavior).

Of course that is entirely expected (that removal of a recently
written file will cause a sudden spike in free memory) so the
PostgreSQL graph is a red herring.

This was unfortunate, and a result of me picking this guy fairly
ad-hoc for the purpose of summarizing the situation in my post to the
list. We have spent considerable time trying to make sure that the
evictions are indeed anomalous and not e.g. due to a file removal in
the case of the actual service where we are negatively affected, but I
was not sufficiently careful before proclaiming that we seem to see a
similar effects on other hosts.

It may still be the case, but I am not finding a case which is
sufficiently clear at this time (being sure requires really looking at
what is going on with each class of machine and eliminating various
forms of backups, log rotation and other behavior exhibited). This
also means that in general it's not certain that we are in fact seeing
this behavior on others at all.

However it does leave all other observations, including the very
direct correlation in time with load spikes and a lack of correlation
with backups jobs, and the fact that the eviction seems to be of
actively used data given the resulting I/O storm. So I feel confident
in saying that we definitely do have an actual issue (although as
previously indicated we have not proven conclusively that there is
absolutely no userspace application allocating and touching lots of
pages suddenly, but it seems very unlikely).

> Just eyeballing it, _most_ of the evictions seem to happen after some
> movement in the active/inactive lists.  We see an "inactive" uptick as
> we start to launder pages, and the page activation doesn't keep up with
> it.  This is a _bit_ weird since we don't see any slab cache or other
> users coming to fill the new space.  Something _wanted_ the memory, so
> why isn't it being used?

In this case we have the writing of a backup file (with corresponding
page touching for reading data). This is followed by a period of
reading the recently written file, followed by it being removed.

> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
> allocations happening in the kernel?

No; we're not using huge pages at all (not consciously). Looking at
/proc/meminfo I can confirm that we just see this:

HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

So hopefully that should not be a factor.

> If you could start recording /proc/{vmstat,buddystat,meminfo,slabinfo},
> it would be immensely useful.  The munin graphs are really great, but
> they don't have the detail which you can get from stuff like vmstat.

Absolutely. I'll get some recording of those going and run for
sufficient duration to correlate with page evictions.

> For a page-cache-heavy workload where you care a lot more about things
> being _in_ cache rather than having good NUMA locality, you probably
> want "zone_reclaim_mode" set to 0:
>
>        http://www.kernel.org/doc/Documentation/sysctl/vm.txt
>
> That'll be a bit more comprehensive than messing with numactl.  It
> really is the best thing if you just don't care about NUMA latencies all
> that much.

Thanks! That looks to be exactly what we would like in this case and,
if Interpret you and the documentation correctly, obviates the need to
ask for interleaved allocation.

>  What kind of hardware is this, btw?

It varies somewhat in age, but all of them are Intel. The oldest ones
have 16 GB of memory and are of this CPU type:

cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz
stepping	: 6
cpu MHz		: 2494.168
cache size	: 6144 KB

While newer ones have ~36 GB memory and are of this CPU type:

cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
stepping	: 5
cpu MHz		: 2000.049
cache size	: 4096 KB

Some variation beyond that may exist, but that is the span (and all
are Intel, 8 cores or more).

numactl --show on older machines:

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0
nodebind: 0
membind: 0

And on newer machines:

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1

-- 
/ Peter Schuller aka scode

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-24 14:02           ` Peter Schüller
@ 2010-11-24 14:14             ` Peter Schüller
  -1 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-24 14:14 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

>> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
>> allocations happening in the kernel?

I forgot to address the second part of this question: How would I best
inspect whether the kernel is doing that?

Looking at the kmalloc() sizes from vmstat -m I have the following on
one of the machines (so very few larger than 4096). But I suspect you
are asking for something different?

kmalloc-8192                 52     56   8192      4
kmalloc-4096              33927  62040   4096      8
kmalloc-2048                338    416   2048     16
kmalloc-1024              76211 246976   1024     32
kmalloc-512                1134   1216    512     32
kmalloc-256              109523 324928    256     32
kmalloc-128                3902   4288    128     32
kmalloc-64               105296 105536     64     64
kmalloc-32                 2120   2176     32    128
kmalloc-16                 4607   4608     16    256
kmalloc-8                  6655   6656      8    512
kmalloc-192                6546   9030    192     21
kmalloc-96                29694  32298     96     42

-- 
/ Peter Schuller aka scode

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-24 14:14             ` Peter Schüller
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-24 14:14 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

>> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
>> allocations happening in the kernel?

I forgot to address the second part of this question: How would I best
inspect whether the kernel is doing that?

Looking at the kmalloc() sizes from vmstat -m I have the following on
one of the machines (so very few larger than 4096). But I suspect you
are asking for something different?

kmalloc-8192                 52     56   8192      4
kmalloc-4096              33927  62040   4096      8
kmalloc-2048                338    416   2048     16
kmalloc-1024              76211 246976   1024     32
kmalloc-512                1134   1216    512     32
kmalloc-256              109523 324928    256     32
kmalloc-128                3902   4288    128     32
kmalloc-64               105296 105536     64     64
kmalloc-32                 2120   2176     32    128
kmalloc-16                 4607   4608     16    256
kmalloc-8                  6655   6656      8    512
kmalloc-192                6546   9030    192     21
kmalloc-96                29694  32298     96     42

-- 
/ Peter Schuller aka scode

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-24 14:14             ` Peter Schüller
@ 2010-11-24 14:20               ` Pekka Enberg
  -1 siblings, 0 replies; 31+ messages in thread
From: Pekka Enberg @ 2010-11-24 14:20 UTC (permalink / raw)
  To: Peter Schüller
  Cc: Dave Hansen, Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

Hi Peter,

2010/11/24 Peter Schüller <scode@spotify.com>:
>>> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
>>> allocations happening in the kernel?
>
> I forgot to address the second part of this question: How would I best
> inspect whether the kernel is doing that?

You can, for example, record

  cat /proc/meminfo | grep Huge

for large page allocations.

> Looking at the kmalloc() sizes from vmstat -m I have the following on
> one of the machines (so very few larger than 4096). But I suspect you
> are asking for something different?

The "pagesperslab" column of /proc/slabinfo tells you how many pages
slab allocates from the page allocator.

                       Pekka

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-24 14:20               ` Pekka Enberg
  0 siblings, 0 replies; 31+ messages in thread
From: Pekka Enberg @ 2010-11-24 14:20 UTC (permalink / raw)
  To: Peter Schüller
  Cc: Dave Hansen, Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

Hi Peter,

2010/11/24 Peter Schüller <scode@spotify.com>:
>>> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
>>> allocations happening in the kernel?
>
> I forgot to address the second part of this question: How would I best
> inspect whether the kernel is doing that?

You can, for example, record

  cat /proc/meminfo | grep Huge

for large page allocations.

> Looking at the kmalloc() sizes from vmstat -m I have the following on
> one of the machines (so very few larger than 4096). But I suspect you
> are asking for something different?

The "pagesperslab" column of /proc/slabinfo tells you how many pages
slab allocates from the page allocator.

                       Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-24 14:20               ` Pekka Enberg
@ 2010-11-24 15:32                 ` Peter Schüller
  -1 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-24 15:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Hansen, Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

>> I forgot to address the second part of this question: How would I best
>> inspect whether the kernel is doing that?
>
> You can, for example, record
>
>  cat /proc/meminfo | grep Huge
>
> for large page allocations.

Those show zero a per my other post. However I got the impression Dave
was asking about regular but larger-than-one-page allocations internal
to the kernel, while the Huge* lines in /proc/meminfo refers to
allocations specifically done by userland applications doing huge page
allocation on a system with huge pages enabled - or am I confused?

> The "pagesperslab" column of /proc/slabinfo tells you how many pages
> slab allocates from the page allocator.

Seems to be what vmstat -m reports.

-- 
/ Peter Schuller aka scode

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-24 15:32                 ` Peter Schüller
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-24 15:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Dave Hansen, Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

>> I forgot to address the second part of this question: How would I best
>> inspect whether the kernel is doing that?
>
> You can, for example, record
>
>  cat /proc/meminfo | grep Huge
>
> for large page allocations.

Those show zero a per my other post. However I got the impression Dave
was asking about regular but larger-than-one-page allocations internal
to the kernel, while the Huge* lines in /proc/meminfo refers to
allocations specifically done by userland applications doing huge page
allocation on a system with huge pages enabled - or am I confused?

> The "pagesperslab" column of /proc/slabinfo tells you how many pages
> slab allocates from the page allocator.

Seems to be what vmstat -m reports.

-- 
/ Peter Schuller aka scode

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-24 14:14             ` Peter Schüller
@ 2010-11-24 17:32               ` Dave Hansen
  -1 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-11-24 17:32 UTC (permalink / raw)
  To: Peter Schüller
  Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

On Wed, 2010-11-24 at 15:14 +0100, Peter Schüller wrote:
> >> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
> >> allocations happening in the kernel?
> 
> I forgot to address the second part of this question: How would I best
> inspect whether the kernel is doing that? 

I found out yesterday how to do it with tracing, but it's not a horribly
simple thing to do in any case.  You can watch the entries in slabinfo
and see if any of the ones with sizes over 4096 bytes are getting used
often.  You can also watch /proc/buddyinfo and see how often columns
other than the first couple are moving around.

Jumbo ethernet frames would be the most common reason to see these
allocations.  It's _probably_ not an issue in your case.

-- Dave


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-24 17:32               ` Dave Hansen
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2010-11-24 17:32 UTC (permalink / raw)
  To: Peter Schüller
  Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

On Wed, 2010-11-24 at 15:14 +0100, Peter Schuller wrote:
> >> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
> >> allocations happening in the kernel?
> 
> I forgot to address the second part of this question: How would I best
> inspect whether the kernel is doing that? 

I found out yesterday how to do it with tracing, but it's not a horribly
simple thing to do in any case.  You can watch the entries in slabinfo
and see if any of the ones with sizes over 4096 bytes are getting used
often.  You can also watch /proc/buddyinfo and see how often columns
other than the first couple are moving around.

Jumbo ethernet frames would be the most common reason to see these
allocations.  It's _probably_ not an issue in your case.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-24 15:32                 ` Peter Schüller
@ 2010-11-24 17:46                   ` Pekka Enberg
  -1 siblings, 0 replies; 31+ messages in thread
From: Pekka Enberg @ 2010-11-24 17:46 UTC (permalink / raw)
  To: Peter Schüller
  Cc: Dave Hansen, Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

Hi Peter,

2010/11/24 Peter Schüller <scode@spotify.com>:
>>> I forgot to address the second part of this question: How would I best
>>> inspect whether the kernel is doing that?
>>
>> You can, for example, record
>>
>>  cat /proc/meminfo | grep Huge
>>
>> for large page allocations.
>
> Those show zero a per my other post. However I got the impression Dave
> was asking about regular but larger-than-one-page allocations internal
> to the kernel, while the Huge* lines in /proc/meminfo refers to
> allocations specifically done by userland applications doing huge page
> allocation on a system with huge pages enabled - or am I confused?

He was asking about both (large page allocations and higher order allocations).

>> The "pagesperslab" column of /proc/slabinfo tells you how many pages
>> slab allocates from the page allocator.
>
> Seems to be what vmstat -m reports.

No, "vmstat -m" reports _total number_ of pages allocated. We're
interested in how many pages slab allocator whenever it needs to
allocate memory for a new slab. That's represented by the
"pagesperslab" column of /proc/slabinfo from which you can deduce the
page allocation order.

                        Pekka

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-24 17:46                   ` Pekka Enberg
  0 siblings, 0 replies; 31+ messages in thread
From: Pekka Enberg @ 2010-11-24 17:46 UTC (permalink / raw)
  To: Peter Schüller
  Cc: Dave Hansen, Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

Hi Peter,

2010/11/24 Peter Schüller <scode@spotify.com>:
>>> I forgot to address the second part of this question: How would I best
>>> inspect whether the kernel is doing that?
>>
>> You can, for example, record
>>
>>  cat /proc/meminfo | grep Huge
>>
>> for large page allocations.
>
> Those show zero a per my other post. However I got the impression Dave
> was asking about regular but larger-than-one-page allocations internal
> to the kernel, while the Huge* lines in /proc/meminfo refers to
> allocations specifically done by userland applications doing huge page
> allocation on a system with huge pages enabled - or am I confused?

He was asking about both (large page allocations and higher order allocations).

>> The "pagesperslab" column of /proc/slabinfo tells you how many pages
>> slab allocates from the page allocator.
>
> Seems to be what vmstat -m reports.

No, "vmstat -m" reports _total number_ of pages allocated. We're
interested in how many pages slab allocator whenever it needs to
allocate memory for a new slab. That's represented by the
"pagesperslab" column of /proc/slabinfo from which you can deduce the
page allocation order.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-24 15:32                 ` Peter Schüller
@ 2010-11-25  1:18                   ` Simon Kirby
  -1 siblings, 0 replies; 31+ messages in thread
From: Simon Kirby @ 2010-11-25  1:18 UTC (permalink / raw)
  To: Peter Sch??ller
  Cc: Pekka Enberg, Dave Hansen, Andrew Morton, linux-kernel,
	Mattias de Zalenski, linux-mm

On Wed, Nov 24, 2010 at 04:32:39PM +0100, Peter Sch??ller wrote:

> >> I forgot to address the second part of this question: How would I best
> >> inspect whether the kernel is doing that?
> >
> > You can, for example, record
> >
> > ??cat /proc/meminfo | grep Huge
> >
> > for large page allocations.
> 
> Those show zero a per my other post. However I got the impression Dave
> was asking about regular but larger-than-one-page allocations internal
> to the kernel, while the Huge* lines in /proc/meminfo refers to
> allocations specifically done by userland applications doing huge page
> allocation on a system with huge pages enabled - or am I confused?

Your page cache dents don't seem quite as big, so it may be something
else, but if it's the same problem we're seeing here, it seems to have to
do with when an order=3 new_slab allocation comes in to grows the kmalloc
slab cache for an __alloc_skb (network packet).  This is normal even
without jumbo frames now.  When there are no zones with order=3
zone_watermark_ok(), kswapd is woken, which frees things all over the
place to try to get zone_watermark_ok(order=3) to be happy.

We're seeing this throw out a huge number of pages, and we're seeing it
happen even with lots of memory free in the zone.  CONFIG_COMPACTION also
currently does not help because try_to_compact_pages() returns early with
COMPACT_SKIPPED if order <= PAGE_ALLOC_COSTLY_ORDER, and, you guessed it,
PAGE_ALLOC_COSTLY_ORDER is set to 3.

I reimplemented zone_pages_ok(order=3) in userspace, and I can see it
happen:

Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll

  Zone order:0      1     2     3    4 5 6 7 8 9 A nr_free state

 DMA32   19026  33652  4897    13    5 1 2 0 0 0 0  106262 337 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19301  33869  4665    12    5 1 2 0 0 0 0  106035 329 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19332  33931  4603     9    5 1 2 0 0 0 0  105918 305 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19467  34057  4468     6    5 1 2 0 0 0 0  105741 281 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19591  34181  4344     5    5 1 2 0 0 0 0  105609 273 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19856  34348  4109     2    5 1 2 0 0 0 0  105244 249 <= 256 !!!
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   24088  36476  5437   144    5 1 2 0 0 0 0  120180 1385 <= 256
Normal    1024      1     0     0    0 0 0 0 0 0 0    1026 -5 <= 238
 DMA32   26453  37440  6676   623   53 1 2 0 0 0 0  134029 5985 <= 256
Normal    8700    100     0     0    0 0 0 0 0 0 0    8900 193 <= 238
 DMA32   48881  38161  7142   966   81 1 2 0 0 0 0  162955 9177 <= 256
Normal    8936    102     0     1    0 0 0 0 0 0 0    9148 205 <= 238
 DMA32   66046  40051  7871  1409  135 2 2 0 0 0 0  191256 13617 <= 256
Normal    9019     18     0     0    0 0 0 0 0 0 0    9055 29 <= 238
 DMA32   67133  48671  8231  1578  143 2 2 0 0 0 0  212503 15097 <= 256

So, kswapd was woken up at the line that ends in "!!!" there, because
free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
an order=3 allocation came in.

Maybe try out that script and see if you see something similar.

Simon-

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-25  1:18                   ` Simon Kirby
  0 siblings, 0 replies; 31+ messages in thread
From: Simon Kirby @ 2010-11-25  1:18 UTC (permalink / raw)
  To: Peter Sch??ller
  Cc: Pekka Enberg, Dave Hansen, Andrew Morton, linux-kernel,
	Mattias de Zalenski, linux-mm

On Wed, Nov 24, 2010 at 04:32:39PM +0100, Peter Sch??ller wrote:

> >> I forgot to address the second part of this question: How would I best
> >> inspect whether the kernel is doing that?
> >
> > You can, for example, record
> >
> > ??cat /proc/meminfo | grep Huge
> >
> > for large page allocations.
> 
> Those show zero a per my other post. However I got the impression Dave
> was asking about regular but larger-than-one-page allocations internal
> to the kernel, while the Huge* lines in /proc/meminfo refers to
> allocations specifically done by userland applications doing huge page
> allocation on a system with huge pages enabled - or am I confused?

Your page cache dents don't seem quite as big, so it may be something
else, but if it's the same problem we're seeing here, it seems to have to
do with when an order=3 new_slab allocation comes in to grows the kmalloc
slab cache for an __alloc_skb (network packet).  This is normal even
without jumbo frames now.  When there are no zones with order=3
zone_watermark_ok(), kswapd is woken, which frees things all over the
place to try to get zone_watermark_ok(order=3) to be happy.

We're seeing this throw out a huge number of pages, and we're seeing it
happen even with lots of memory free in the zone.  CONFIG_COMPACTION also
currently does not help because try_to_compact_pages() returns early with
COMPACT_SKIPPED if order <= PAGE_ALLOC_COSTLY_ORDER, and, you guessed it,
PAGE_ALLOC_COSTLY_ORDER is set to 3.

I reimplemented zone_pages_ok(order=3) in userspace, and I can see it
happen:

Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll

  Zone order:0      1     2     3    4 5 6 7 8 9 A nr_free state

 DMA32   19026  33652  4897    13    5 1 2 0 0 0 0  106262 337 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19301  33869  4665    12    5 1 2 0 0 0 0  106035 329 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19332  33931  4603     9    5 1 2 0 0 0 0  105918 305 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19467  34057  4468     6    5 1 2 0 0 0 0  105741 281 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19591  34181  4344     5    5 1 2 0 0 0 0  105609 273 <= 256
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   19856  34348  4109     2    5 1 2 0 0 0 0  105244 249 <= 256 !!!
Normal     450      0     0     0    0 0 0 0 0 0 0     450 -7 <= 238
 DMA32   24088  36476  5437   144    5 1 2 0 0 0 0  120180 1385 <= 256
Normal    1024      1     0     0    0 0 0 0 0 0 0    1026 -5 <= 238
 DMA32   26453  37440  6676   623   53 1 2 0 0 0 0  134029 5985 <= 256
Normal    8700    100     0     0    0 0 0 0 0 0 0    8900 193 <= 238
 DMA32   48881  38161  7142   966   81 1 2 0 0 0 0  162955 9177 <= 256
Normal    8936    102     0     1    0 0 0 0 0 0 0    9148 205 <= 238
 DMA32   66046  40051  7871  1409  135 2 2 0 0 0 0  191256 13617 <= 256
Normal    9019     18     0     0    0 0 0 0 0 0 0    9055 29 <= 238
 DMA32   67133  48671  8231  1578  143 2 2 0 0 0 0  212503 15097 <= 256

So, kswapd was woken up at the line that ends in "!!!" there, because
free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
an order=3 allocation came in.

Maybe try out that script and see if you see something similar.

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-24 17:32               ` Dave Hansen
@ 2010-11-25 15:33                 ` Peter Schüller
  -1 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-25 15:33 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

> simple thing to do in any case.  You can watch the entries in slabinfo
> and see if any of the ones with sizes over 4096 bytes are getting used
> often.  You can also watch /proc/buddyinfo and see how often columns
> other than the first couple are moving around.

I collected some information from
/proc/{buddyinfo,meminfo,slabinfo,vmstat} and let it sit, polling
approximately once per minute. I have some results correlated with
another page eviction in graphs. The graph is here:

   http://files.spotify.com/memcut/memgraph-20101124.png

The last sudden eviction there occurred somewhere between 22:30 and
22:45. Some URL:s that can be compared for those periods:

   Before:
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/vmstat
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/buddyinfo
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/meminfo
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/slabinfo

   After:
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/vmstat
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/buddyinfo
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/meminfo
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/slabinfo

A more complete set of files for several minutes before/after/during
is in this tarball:

   http://files.spotify.com/memcut/memgraph-20101124.tar.gz

Diffing the two slabinfos above yields among other things the
following expected or at least plausible decreases since they
correlate with the symptomes:

-ext3_inode_cache  254638 693552    760   43    8 : tunables    0    0
   0 : slabdata  16130  16130      0
+ext3_inode_cache   81698 468958    760   43    8 : tunables    0    0
   0 : slabdata  10906  10906      0
-dentry             96504 344232    192   42    2 : tunables    0    0
   0 : slabdata   8196   8196      0
+dentry             55628 243810    192   42    2 : tunables    0    0
   0 : slabdata   5805   5805      0

And if my understanding is correct, buffer_head would be meta-data
associated with cached pages and thus be expected to drop in
correlation with less data cached:

-buffer_head       2109250 4979052    104   39    1 : tunables    0
0    0 : slabdata 127668 127668      0
+buffer_head       838859 4059822    104   39    1 : tunables    0
0    0 : slabdata 104098 104098      0

My knowledge of the implementation is lacking far too much to know
where best to look for the likely culprit in terms of the root cause
of the eviction. The one thing I thought looked suspicious was the
kmalloc increases:

-kmalloc-4096         301    328   4096    8    8 : tunables    0    0
   0 : slabdata     41     41      0
+kmalloc-4096         637    680   4096    8    8 : tunables    0    0
   0 : slabdata     85     85      0
-kmalloc-2048       18215  19696   2048   16    8 : tunables    0    0
   0 : slabdata   1231   1231      0
+kmalloc-2048       41908  51792   2048   16    8 : tunables    0    0
   0 : slabdata   3237   3237      0
-kmalloc-1024       85444  97280   1024   32    8 : tunables    0    0
   0 : slabdata   3040   3040      0
+kmalloc-1024      267031 327104   1024   32    8 : tunables    0    0
   0 : slabdata  10222  10222      0
-kmalloc-512         1988   2176    512   32    4 : tunables    0    0
   0 : slabdata     68     68      0
+kmalloc-512         1692   2080    512   32    4 : tunables    0    0
   0 : slabdata     65     65      0
-kmalloc-256       102588 119776    256   32    2 : tunables    0    0
   0 : slabdata   3743   3743      0
+kmalloc-256       308470 370720    256   32    2 : tunables    0    0
   0 : slabdata  11585  11585      0
-kmalloc-128         8435   9760    128   32    1 : tunables    0    0
   0 : slabdata    305    305      0
+kmalloc-128         8524   9760    128   32    1 : tunables    0    0
   0 : slabdata    305    305      0
-kmalloc-64         96176 405440     64   64    1 : tunables    0    0
   0 : slabdata   6335   6335      0
+kmalloc-64         50001 352448     64   64    1 : tunables    0    0
   0 : slabdata   5507   5507      0

If my interpretation and understanding is correct, this indicates that
for example, ~3000 to ~10000 3-order allocations resulting from 1 kb
kmalloc():s. Meaning about 0.2 gig ( 7000*4*8*1024/1024/1024). Add the
other ones and we get some more, but only a few hundred megs in total.

Going by the hypothesis that we are seeing the same thing as reported
by Simon Kirby (I'll respond to that E-Mail separately), the total
amount is (as far as I understand) not the important part, but the
fact that we saw a non-trivial increase in 3-order allocations would
perhaps be a consistent observation in that frequent 3-order
allocations might be more likely to trigger the behavior Simon
reports.

I can do additional post-processing on the data that was dumped (such
as graphing it), but I'm not sure which parts would be the most
interesting for figuring out what is going on. Is there something else
I should be collecting for that matter, than the
vmstat/slabinfo/buddyinfo/meminfo?

-- 
/ Peter Schuller aka scode

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-25 15:33                 ` Peter Schüller
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-25 15:33 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

> simple thing to do in any case.  You can watch the entries in slabinfo
> and see if any of the ones with sizes over 4096 bytes are getting used
> often.  You can also watch /proc/buddyinfo and see how often columns
> other than the first couple are moving around.

I collected some information from
/proc/{buddyinfo,meminfo,slabinfo,vmstat} and let it sit, polling
approximately once per minute. I have some results correlated with
another page eviction in graphs. The graph is here:

   http://files.spotify.com/memcut/memgraph-20101124.png

The last sudden eviction there occurred somewhere between 22:30 and
22:45. Some URL:s that can be compared for those periods:

   Before:
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/vmstat
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/buddyinfo
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/meminfo
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/slabinfo

   After:
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/vmstat
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/buddyinfo
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/meminfo
   http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/slabinfo

A more complete set of files for several minutes before/after/during
is in this tarball:

   http://files.spotify.com/memcut/memgraph-20101124.tar.gz

Diffing the two slabinfos above yields among other things the
following expected or at least plausible decreases since they
correlate with the symptomes:

-ext3_inode_cache  254638 693552    760   43    8 : tunables    0    0
   0 : slabdata  16130  16130      0
+ext3_inode_cache   81698 468958    760   43    8 : tunables    0    0
   0 : slabdata  10906  10906      0
-dentry             96504 344232    192   42    2 : tunables    0    0
   0 : slabdata   8196   8196      0
+dentry             55628 243810    192   42    2 : tunables    0    0
   0 : slabdata   5805   5805      0

And if my understanding is correct, buffer_head would be meta-data
associated with cached pages and thus be expected to drop in
correlation with less data cached:

-buffer_head       2109250 4979052    104   39    1 : tunables    0
0    0 : slabdata 127668 127668      0
+buffer_head       838859 4059822    104   39    1 : tunables    0
0    0 : slabdata 104098 104098      0

My knowledge of the implementation is lacking far too much to know
where best to look for the likely culprit in terms of the root cause
of the eviction. The one thing I thought looked suspicious was the
kmalloc increases:

-kmalloc-4096         301    328   4096    8    8 : tunables    0    0
   0 : slabdata     41     41      0
+kmalloc-4096         637    680   4096    8    8 : tunables    0    0
   0 : slabdata     85     85      0
-kmalloc-2048       18215  19696   2048   16    8 : tunables    0    0
   0 : slabdata   1231   1231      0
+kmalloc-2048       41908  51792   2048   16    8 : tunables    0    0
   0 : slabdata   3237   3237      0
-kmalloc-1024       85444  97280   1024   32    8 : tunables    0    0
   0 : slabdata   3040   3040      0
+kmalloc-1024      267031 327104   1024   32    8 : tunables    0    0
   0 : slabdata  10222  10222      0
-kmalloc-512         1988   2176    512   32    4 : tunables    0    0
   0 : slabdata     68     68      0
+kmalloc-512         1692   2080    512   32    4 : tunables    0    0
   0 : slabdata     65     65      0
-kmalloc-256       102588 119776    256   32    2 : tunables    0    0
   0 : slabdata   3743   3743      0
+kmalloc-256       308470 370720    256   32    2 : tunables    0    0
   0 : slabdata  11585  11585      0
-kmalloc-128         8435   9760    128   32    1 : tunables    0    0
   0 : slabdata    305    305      0
+kmalloc-128         8524   9760    128   32    1 : tunables    0    0
   0 : slabdata    305    305      0
-kmalloc-64         96176 405440     64   64    1 : tunables    0    0
   0 : slabdata   6335   6335      0
+kmalloc-64         50001 352448     64   64    1 : tunables    0    0
   0 : slabdata   5507   5507      0

If my interpretation and understanding is correct, this indicates that
for example, ~3000 to ~10000 3-order allocations resulting from 1 kb
kmalloc():s. Meaning about 0.2 gig ( 7000*4*8*1024/1024/1024). Add the
other ones and we get some more, but only a few hundred megs in total.

Going by the hypothesis that we are seeing the same thing as reported
by Simon Kirby (I'll respond to that E-Mail separately), the total
amount is (as far as I understand) not the important part, but the
fact that we saw a non-trivial increase in 3-order allocations would
perhaps be a consistent observation in that frequent 3-order
allocations might be more likely to trigger the behavior Simon
reports.

I can do additional post-processing on the data that was dumped (such
as graphing it), but I'm not sure which parts would be the most
interesting for figuring out what is going on. Is there something else
I should be collecting for that matter, than the
vmstat/slabinfo/buddyinfo/meminfo?

-- 
/ Peter Schuller aka scode

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-25  1:18                   ` Simon Kirby
@ 2010-11-25 15:59                     ` Peter Schüller
  -1 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-25 15:59 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Pekka Enberg, Dave Hansen, Andrew Morton, linux-kernel,
	Mattias de Zalenski, linux-mm

> Your page cache dents don't seem quite as big, so it may be something
> else, but if it's the same problem we're seeing here, it seems to have to
> do with when an order=3 new_slab allocation comes in to grows the kmalloc
> slab cache for an __alloc_skb (network packet).  This is normal even
> without jumbo frames now.  When there are no zones with order=3
> zone_watermark_ok(), kswapd is woken, which frees things all over the
> place to try to get zone_watermark_ok(order=3) to be happy.
> We're seeing this throw out a huge number of pages, and we're seeing it
> happen even with lots of memory free in the zone.

Is there some way to observe this directly (the amount evicted for low
watermark reasons)?

If not, is logging/summing the return value of balance_pgdat() in
kswapd() (mm/vmscan.c) be the way to accomplish this?

My understanding (and I am saying it just so that people can tell my
if I'm wrong) is that what you're saying implies that kswapd keeps
getting woken up in wakeup_kswapd() due to zone_watermark_ok(), but
kswapd()'s invocation of balance_pgdat() is unable to bring levels
above the low water mark but but evicted large amounts of data while
trying?

> I reimplemented zone_pages_ok(order=3) in userspace, and I can see it
> happen:

(For the ML record/others: I believe that was meant to be
zone_watermark_ok(), not zone_pages_ok(). It's in mm/page_alloc.c)

> Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll

[snip output]

> So, kswapd was woken up at the line that ends in "!!!" there, because
> free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
> an order=3 allocation came in.
>
> Maybe try out that script and see if you see something similar.

Thanks! That looks great. I'll try to set up data collection where
this can be observed and then correlated with a graph and the
vmstat/slabinfo that I just posted, the next time we see an eviction.

(For the record it triggers constantly on my desktop, but that is with
2.6.32 and I'm assuming it is due to differences in that kernel, so
I'm not bothering investigating. It's not triggering constantly on the
2.6.26-rc6 kernel on the production system, and hopefully we can see
it trigger during the evictions.)

-- 
/ Peter Schuller aka scode

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-11-25 15:59                     ` Peter Schüller
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Schüller @ 2010-11-25 15:59 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Pekka Enberg, Dave Hansen, Andrew Morton, linux-kernel,
	Mattias de Zalenski, linux-mm

> Your page cache dents don't seem quite as big, so it may be something
> else, but if it's the same problem we're seeing here, it seems to have to
> do with when an order=3 new_slab allocation comes in to grows the kmalloc
> slab cache for an __alloc_skb (network packet).  This is normal even
> without jumbo frames now.  When there are no zones with order=3
> zone_watermark_ok(), kswapd is woken, which frees things all over the
> place to try to get zone_watermark_ok(order=3) to be happy.
> We're seeing this throw out a huge number of pages, and we're seeing it
> happen even with lots of memory free in the zone.

Is there some way to observe this directly (the amount evicted for low
watermark reasons)?

If not, is logging/summing the return value of balance_pgdat() in
kswapd() (mm/vmscan.c) be the way to accomplish this?

My understanding (and I am saying it just so that people can tell my
if I'm wrong) is that what you're saying implies that kswapd keeps
getting woken up in wakeup_kswapd() due to zone_watermark_ok(), but
kswapd()'s invocation of balance_pgdat() is unable to bring levels
above the low water mark but but evicted large amounts of data while
trying?

> I reimplemented zone_pages_ok(order=3) in userspace, and I can see it
> happen:

(For the ML record/others: I believe that was meant to be
zone_watermark_ok(), not zone_pages_ok(). It's in mm/page_alloc.c)

> Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll

[snip output]

> So, kswapd was woken up at the line that ends in "!!!" there, because
> free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
> an order=3 allocation came in.
>
> Maybe try out that script and see if you see something similar.

Thanks! That looks great. I'll try to set up data collection where
this can be observed and then correlated with a graph and the
vmstat/slabinfo that I just posted, the next time we see an eviction.

(For the record it triggers constantly on my desktop, but that is with
2.6.32 and I'm assuming it is due to differences in that kernel, so
I'm not bothering investigating. It's not triggering constantly on the
2.6.26-rc6 kernel on the production system, and hopefully we can see
it trigger during the evictions.)

-- 
/ Peter Schuller aka scode

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-25 15:59                     ` Peter Schüller
@ 2010-12-01  6:36                       ` Simon Kirby
  -1 siblings, 0 replies; 31+ messages in thread
From: Simon Kirby @ 2010-12-01  6:36 UTC (permalink / raw)
  To: Peter Sch??ller
  Cc: Pekka Enberg, Dave Hansen, Andrew Morton, linux-kernel,
	Mattias de Zalenski, linux-mm

Hello!  Sorry, I didn't see your email until now.

On Thu, Nov 25, 2010 at 04:59:25PM +0100, Peter Sch??ller wrote:

> > Your page cache dents don't seem quite as big, so it may be something
> > else, but if it's the same problem we're seeing here, it seems to have to
> > do with when an order=3 new_slab allocation comes in to grows the kmalloc
> > slab cache for an __alloc_skb (network packet). ??This is normal even
> > without jumbo frames now. ??When there are no zones with order=3
> > zone_watermark_ok(), kswapd is woken, which frees things all over the
> > place to try to get zone_watermark_ok(order=3) to be happy.
> > We're seeing this throw out a huge number of pages, and we're seeing it
> > happen even with lots of memory free in the zone.
> 
> Is there some way to observe this directly (the amount evicted for low
> watermark reasons)?
> 
> If not, is logging/summing the return value of balance_pgdat() in
> kswapd() (mm/vmscan.c) be the way to accomplish this?

The way I could "see" it was with http://0x.ca/sim/ref/2.6.36/vmstat_dump
, which updates fast enough (if your terminal doesn't suck) to actually
see the 10 Hz wakeups and similar patterns.  pageoutrun goes up each time
balance_pgdat() passes the "loop_again" label.

Watching http://0x.ca/sim/ref/2.6.36/buddyinfo_dump at the same time
shows that while pageoutrun is increasing, Normal pages are being
allocated and freed all very quickly but not reaching the watermark,
and the other kswapd-related counters seem to show this.

> My understanding (and I am saying it just so that people can tell my
> if I'm wrong) is that what you're saying implies that kswapd keeps
> getting woken up in wakeup_kswapd() due to zone_watermark_ok(), but
> kswapd()'s invocation of balance_pgdat() is unable to bring levels
> above the low water mark but but evicted large amounts of data while
> trying?

Yes, though there are a couple of issues here.  See the "Free memory
never fully used, swapping" thread and Mel Gorman's comments/patches. 

His patches fix kswapd fighting the allocator for me, but I'm still
running into problems with what seems to be fragmentation making it
difficult for kswapd to meet the higher order watermarks after a few
days.  Even with everything in slub set order 0 except order 1 where
absolutely required is still seeming to result in lots of free memory
after a week or so of normal load.

> (For the ML record/others: I believe that was meant to be
> zone_watermark_ok(), not zone_pages_ok(). It's in mm/page_alloc.c)

Yes :)

> > Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll
> 
> [snip output]
> 
> > So, kswapd was woken up at the line that ends in "!!!" there, because
> > free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
> > an order=3 allocation came in.
> >
> > Maybe try out that script and see if you see something similar.
> 
> Thanks! That looks great. I'll try to set up data collection where
> this can be observed and then correlated with a graph and the
> vmstat/slabinfo that I just posted, the next time we see an eviction.
> 
> (For the record it triggers constantly on my desktop, but that is with
> 2.6.32 and I'm assuming it is due to differences in that kernel, so
> I'm not bothering investigating. It's not triggering constantly on the
> 2.6.26-rc6 kernel on the production system, and hopefully we can see
> it trigger during the evictions.)

Well, nothing seems to care about higher order watermarks unless
something atually tries to allocate from them, so if you don't have
any order-3 allocations, it's unlikely that it will be met after
you've filled free memory.  This isn't immediately obvious, because
zone_watermark_ok() works by subtracting free pages in _lower_ orders
from the total free pages in a zone.  So, it's the free blocks in the
specified order and _bigger_ that matter in buddyinfo.  Also,
min_free_kbytes and lowmem_reserve_ratio prevent all of the buddies
from being split even if _only_ order 0 allocations have ever occurred.

Anyway, how did it go?  Did you find anything interesting? :)

Simon-

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-12-01  6:36                       ` Simon Kirby
  0 siblings, 0 replies; 31+ messages in thread
From: Simon Kirby @ 2010-12-01  6:36 UTC (permalink / raw)
  To: Peter Sch??ller
  Cc: Pekka Enberg, Dave Hansen, Andrew Morton, linux-kernel,
	Mattias de Zalenski, linux-mm

Hello!  Sorry, I didn't see your email until now.

On Thu, Nov 25, 2010 at 04:59:25PM +0100, Peter Sch??ller wrote:

> > Your page cache dents don't seem quite as big, so it may be something
> > else, but if it's the same problem we're seeing here, it seems to have to
> > do with when an order=3 new_slab allocation comes in to grows the kmalloc
> > slab cache for an __alloc_skb (network packet). ??This is normal even
> > without jumbo frames now. ??When there are no zones with order=3
> > zone_watermark_ok(), kswapd is woken, which frees things all over the
> > place to try to get zone_watermark_ok(order=3) to be happy.
> > We're seeing this throw out a huge number of pages, and we're seeing it
> > happen even with lots of memory free in the zone.
> 
> Is there some way to observe this directly (the amount evicted for low
> watermark reasons)?
> 
> If not, is logging/summing the return value of balance_pgdat() in
> kswapd() (mm/vmscan.c) be the way to accomplish this?

The way I could "see" it was with http://0x.ca/sim/ref/2.6.36/vmstat_dump
, which updates fast enough (if your terminal doesn't suck) to actually
see the 10 Hz wakeups and similar patterns.  pageoutrun goes up each time
balance_pgdat() passes the "loop_again" label.

Watching http://0x.ca/sim/ref/2.6.36/buddyinfo_dump at the same time
shows that while pageoutrun is increasing, Normal pages are being
allocated and freed all very quickly but not reaching the watermark,
and the other kswapd-related counters seem to show this.

> My understanding (and I am saying it just so that people can tell my
> if I'm wrong) is that what you're saying implies that kswapd keeps
> getting woken up in wakeup_kswapd() due to zone_watermark_ok(), but
> kswapd()'s invocation of balance_pgdat() is unable to bring levels
> above the low water mark but but evicted large amounts of data while
> trying?

Yes, though there are a couple of issues here.  See the "Free memory
never fully used, swapping" thread and Mel Gorman's comments/patches. 

His patches fix kswapd fighting the allocator for me, but I'm still
running into problems with what seems to be fragmentation making it
difficult for kswapd to meet the higher order watermarks after a few
days.  Even with everything in slub set order 0 except order 1 where
absolutely required is still seeming to result in lots of free memory
after a week or so of normal load.

> (For the ML record/others: I believe that was meant to be
> zone_watermark_ok(), not zone_pages_ok(). It's in mm/page_alloc.c)

Yes :)

> > Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll
> 
> [snip output]
> 
> > So, kswapd was woken up at the line that ends in "!!!" there, because
> > free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
> > an order=3 allocation came in.
> >
> > Maybe try out that script and see if you see something similar.
> 
> Thanks! That looks great. I'll try to set up data collection where
> this can be observed and then correlated with a graph and the
> vmstat/slabinfo that I just posted, the next time we see an eviction.
> 
> (For the record it triggers constantly on my desktop, but that is with
> 2.6.32 and I'm assuming it is due to differences in that kernel, so
> I'm not bothering investigating. It's not triggering constantly on the
> 2.6.26-rc6 kernel on the production system, and hopefully we can see
> it trigger during the evictions.)

Well, nothing seems to care about higher order watermarks unless
something atually tries to allocate from them, so if you don't have
any order-3 allocations, it's unlikely that it will be met after
you've filled free memory.  This isn't immediately obvious, because
zone_watermark_ok() works by subtracting free pages in _lower_ orders
from the total free pages in a zone.  So, it's the free blocks in the
specified order and _bigger_ that matter in buddyinfo.  Also,
min_free_kbytes and lowmem_reserve_ratio prevent all of the buddies
from being split even if _only_ order 0 allocations have ever occurred.

Anyway, how did it go?  Did you find anything interesting? :)

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
  2010-11-25 15:33                 ` Peter Schüller
@ 2010-12-01  9:15                   ` Simon Kirby
  -1 siblings, 0 replies; 31+ messages in thread
From: Simon Kirby @ 2010-12-01  9:15 UTC (permalink / raw)
  To: Peter Sch??ller
  Cc: Dave Hansen, Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

On Thu, Nov 25, 2010 at 04:33:01PM +0100, Peter Sch??ller wrote:

> > simple thing to do in any case. ??You can watch the entries in slabinfo
> > and see if any of the ones with sizes over 4096 bytes are getting used
> > often. ??You can also watch /proc/buddyinfo and see how often columns
> > other than the first couple are moving around.
> 
> I collected some information from
> /proc/{buddyinfo,meminfo,slabinfo,vmstat} and let it sit, polling
> approximately once per minute. I have some results correlated with
> another page eviction in graphs. The graph is here:
> 
>    http://files.spotify.com/memcut/memgraph-20101124.png
> 
> The last sudden eviction there occurred somewhere between 22:30 and
> 22:45. Some URL:s that can be compared for those periods:
> 
>    Before:
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/vmstat
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/buddyinfo
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/meminfo
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/slabinfo
> 
>    After:
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/vmstat
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/buddyinfo
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/meminfo
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/slabinfo

Disclaimer: I have no idea what I'm doing. :)

Your buddyinfo looks to be pretty low for order 3 and above, before and
after the sudden eviction, so my guess is that it's probably related to
the issues I'm seeing with fragmentation, but maybe not fighting between
zones, since you seem to have a larger Normal zone than DMA32.  (Not
sure, you didn't post /proc/zoneinfo).  Also, you seem to be on an
actual NUMA system, so other things are happening there, too.

If you have munin installed (it looks like you do), try enabling the
buddyinfo plugin available since munin 1.4.4.  It graphs the buddyinfo
data, so it could be lined up with the memory graphs (thanks Erik).

[snip]

> kmalloc increases:
> 
> -kmalloc-4096         301    328   4096    8    8 : tunables    0    0
>    0 : slabdata     41     41      0
> +kmalloc-4096         637    680   4096    8    8 : tunables    0    0
>    0 : slabdata     85     85      0
> -kmalloc-2048       18215  19696   2048   16    8 : tunables    0    0
>    0 : slabdata   1231   1231      0
> +kmalloc-2048       41908  51792   2048   16    8 : tunables    0    0
>    0 : slabdata   3237   3237      0
> -kmalloc-1024       85444  97280   1024   32    8 : tunables    0    0
>    0 : slabdata   3040   3040      0
> +kmalloc-1024      267031 327104   1024   32    8 : tunables    0    0
>    0 : slabdata  10222  10222      0

Note that all of the above are actually attempting order-3 allocations
first; see /sys/kernel/slab/kmalloc-1024/order, for instance.  The "8" is
means "8 pages per slab", which means order 3 is the attempted allocation
size.

I did the following on a system to test, but the free memory did not
actually improve.  It seems that even only order 1 allocations are enough
to reclaim too much order 0.  Even a "while true; sleep .01; done" caused
free memory to start increasing due to order 1 (task_struct allocation)
watermarks waking kswapd, while our other usual VM activity is happening.

#!/bin/bash

for i in /sys/kernel/slab/*/; do
        if [ `cat $i/object_size` -le 4096 ]; then
                echo 0 > $i/order
        else
                echo 1 > $i/order
        fi
done

But this is on another machine, without Mel's patch, and with 8 GB
memory, so a bigger Normal zone.

[snip]

> If my interpretation and understanding is correct, this indicates that
> for example, ~3000 to ~10000 3-order allocations resulting from 1 kb
> kmalloc():s. Meaning about 0.2 gig ( 7000*4*8*1024/1024/1024). Add the
> other ones and we get some more, but only a few hundred megs in total.
> 
> Going by the hypothesis that we are seeing the same thing as reported
> by Simon Kirby (I'll respond to that E-Mail separately), the total
> amount is (as far as I understand) not the important part, but the
> fact that we saw a non-trivial increase in 3-order allocations would
> perhaps be a consistent observation in that frequent 3-order
> allocations might be more likely to trigger the behavior Simon
> reports.

Try installing the "perf" tool.  It can be built from the kernel tree in
tools/perf, and then you usually can just copy the binary around.  You
can use it to trace the points which cause kswapd to wake up, which will
show which processes are doing it, the order, flags, etc.

Just before the eviction is about to happen (or whenever), try this:

perf record --event vmscan:mm_vmscan_wakeup_kswapd --filter 'order>=3' \
	--call-graph -a sleep 30

Then view the recorded events with "perf trace", which should spit out
something like this:

    lmtp-3531  [003] 432339.243851: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
    lmtp-3531  [003] 432339.243856: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3

The process which woke kswapd may not be directly responsible for the
allocation as a network interrrupt or something could have happened on
top of it.  See "perf report", which is a bit dodgy at least for me, to
see the stack traces, which might make things clearer.  For example, my
traces show that even kswapd wakes kswapd sometimes, but it's because of
a trace like this:

    -      9.09%  kswapd0  [kernel.kallsyms]  [k] wakeup_kswapd
         wakeup_kswapd
         __alloc_pages_nodemask
         alloc_pages_current
         new_slab
         __slab_alloc
         __kmalloc_node_track_caller
         __alloc_skb
         __netdev_alloc_skb
         bnx2_poll_work
         bnx2_poll
         net_rx_action
         __do_softirq
         call_softirq
         do_softirq
         irq_exit
         do_IRQ
         ret_from_intr
         truncate_inode_pages
         proc_evict_inode
         evict
         iput
         dentry_iput
         d_kill
         __shrink_dcache_sb
         shrink_dcache_memory
         shrink_slab
         kswapd

Anyway, maybe you'll see some interesting traces.  If kswapd isn't waking
very often, you can also trace "kmem:mm_page_alloc" or similar (see "perf
list"), or try a smaller order or a longer sleep.

Cheers,

Simon-

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Sudden and massive page cache eviction
@ 2010-12-01  9:15                   ` Simon Kirby
  0 siblings, 0 replies; 31+ messages in thread
From: Simon Kirby @ 2010-12-01  9:15 UTC (permalink / raw)
  To: Peter Sch??ller
  Cc: Dave Hansen, Andrew Morton, linux-kernel, Mattias de Zalenski, linux-mm

On Thu, Nov 25, 2010 at 04:33:01PM +0100, Peter Sch??ller wrote:

> > simple thing to do in any case. ??You can watch the entries in slabinfo
> > and see if any of the ones with sizes over 4096 bytes are getting used
> > often. ??You can also watch /proc/buddyinfo and see how often columns
> > other than the first couple are moving around.
> 
> I collected some information from
> /proc/{buddyinfo,meminfo,slabinfo,vmstat} and let it sit, polling
> approximately once per minute. I have some results correlated with
> another page eviction in graphs. The graph is here:
> 
>    http://files.spotify.com/memcut/memgraph-20101124.png
> 
> The last sudden eviction there occurred somewhere between 22:30 and
> 22:45. Some URL:s that can be compared for those periods:
> 
>    Before:
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/vmstat
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/buddyinfo
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/meminfo
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/slabinfo
> 
>    After:
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/vmstat
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/buddyinfo
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/meminfo
>    http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/slabinfo

Disclaimer: I have no idea what I'm doing. :)

Your buddyinfo looks to be pretty low for order 3 and above, before and
after the sudden eviction, so my guess is that it's probably related to
the issues I'm seeing with fragmentation, but maybe not fighting between
zones, since you seem to have a larger Normal zone than DMA32.  (Not
sure, you didn't post /proc/zoneinfo).  Also, you seem to be on an
actual NUMA system, so other things are happening there, too.

If you have munin installed (it looks like you do), try enabling the
buddyinfo plugin available since munin 1.4.4.  It graphs the buddyinfo
data, so it could be lined up with the memory graphs (thanks Erik).

[snip]

> kmalloc increases:
> 
> -kmalloc-4096         301    328   4096    8    8 : tunables    0    0
>    0 : slabdata     41     41      0
> +kmalloc-4096         637    680   4096    8    8 : tunables    0    0
>    0 : slabdata     85     85      0
> -kmalloc-2048       18215  19696   2048   16    8 : tunables    0    0
>    0 : slabdata   1231   1231      0
> +kmalloc-2048       41908  51792   2048   16    8 : tunables    0    0
>    0 : slabdata   3237   3237      0
> -kmalloc-1024       85444  97280   1024   32    8 : tunables    0    0
>    0 : slabdata   3040   3040      0
> +kmalloc-1024      267031 327104   1024   32    8 : tunables    0    0
>    0 : slabdata  10222  10222      0

Note that all of the above are actually attempting order-3 allocations
first; see /sys/kernel/slab/kmalloc-1024/order, for instance.  The "8" is
means "8 pages per slab", which means order 3 is the attempted allocation
size.

I did the following on a system to test, but the free memory did not
actually improve.  It seems that even only order 1 allocations are enough
to reclaim too much order 0.  Even a "while true; sleep .01; done" caused
free memory to start increasing due to order 1 (task_struct allocation)
watermarks waking kswapd, while our other usual VM activity is happening.

#!/bin/bash

for i in /sys/kernel/slab/*/; do
        if [ `cat $i/object_size` -le 4096 ]; then
                echo 0 > $i/order
        else
                echo 1 > $i/order
        fi
done

But this is on another machine, without Mel's patch, and with 8 GB
memory, so a bigger Normal zone.

[snip]

> If my interpretation and understanding is correct, this indicates that
> for example, ~3000 to ~10000 3-order allocations resulting from 1 kb
> kmalloc():s. Meaning about 0.2 gig ( 7000*4*8*1024/1024/1024). Add the
> other ones and we get some more, but only a few hundred megs in total.
> 
> Going by the hypothesis that we are seeing the same thing as reported
> by Simon Kirby (I'll respond to that E-Mail separately), the total
> amount is (as far as I understand) not the important part, but the
> fact that we saw a non-trivial increase in 3-order allocations would
> perhaps be a consistent observation in that frequent 3-order
> allocations might be more likely to trigger the behavior Simon
> reports.

Try installing the "perf" tool.  It can be built from the kernel tree in
tools/perf, and then you usually can just copy the binary around.  You
can use it to trace the points which cause kswapd to wake up, which will
show which processes are doing it, the order, flags, etc.

Just before the eviction is about to happen (or whenever), try this:

perf record --event vmscan:mm_vmscan_wakeup_kswapd --filter 'order>=3' \
	--call-graph -a sleep 30

Then view the recorded events with "perf trace", which should spit out
something like this:

    lmtp-3531  [003] 432339.243851: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
    lmtp-3531  [003] 432339.243856: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3

The process which woke kswapd may not be directly responsible for the
allocation as a network interrrupt or something could have happened on
top of it.  See "perf report", which is a bit dodgy at least for me, to
see the stack traces, which might make things clearer.  For example, my
traces show that even kswapd wakes kswapd sometimes, but it's because of
a trace like this:

    -      9.09%  kswapd0  [kernel.kallsyms]  [k] wakeup_kswapd
         wakeup_kswapd
         __alloc_pages_nodemask
         alloc_pages_current
         new_slab
         __slab_alloc
         __kmalloc_node_track_caller
         __alloc_skb
         __netdev_alloc_skb
         bnx2_poll_work
         bnx2_poll
         net_rx_action
         __do_softirq
         call_softirq
         do_softirq
         irq_exit
         do_IRQ
         ret_from_intr
         truncate_inode_pages
         proc_evict_inode
         evict
         iput
         dentry_iput
         d_kill
         __shrink_dcache_sb
         shrink_dcache_memory
         shrink_slab
         kswapd

Anyway, maybe you'll see some interesting traces.  If kswapd isn't waking
very often, you can also trace "kmem:mm_page_alloc" or similar (see "perf
list"), or try a smaller order or a longer sleep.

Cheers,

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2010-12-01  9:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-12 16:20 Sudden and massive page cache eviction Peter Schüller
2010-11-23  0:11 ` Andrew Morton
2010-11-23  0:11   ` Andrew Morton
2010-11-23  8:38   ` Dave Hansen
2010-11-23  8:38     ` Dave Hansen
2010-11-23  9:44     ` Peter Schüller
2010-11-23  9:44       ` Peter Schüller
2010-11-23 16:19       ` Dave Hansen
2010-11-23 16:19         ` Dave Hansen
2010-11-24 14:02         ` Peter Schüller
2010-11-24 14:02           ` Peter Schüller
2010-11-24 14:14           ` Peter Schüller
2010-11-24 14:14             ` Peter Schüller
2010-11-24 14:20             ` Pekka Enberg
2010-11-24 14:20               ` Pekka Enberg
2010-11-24 15:32               ` Peter Schüller
2010-11-24 15:32                 ` Peter Schüller
2010-11-24 17:46                 ` Pekka Enberg
2010-11-24 17:46                   ` Pekka Enberg
2010-11-25  1:18                 ` Simon Kirby
2010-11-25  1:18                   ` Simon Kirby
2010-11-25 15:59                   ` Peter Schüller
2010-11-25 15:59                     ` Peter Schüller
2010-12-01  6:36                     ` Simon Kirby
2010-12-01  6:36                       ` Simon Kirby
2010-11-24 17:32             ` Dave Hansen
2010-11-24 17:32               ` Dave Hansen
2010-11-25 15:33               ` Peter Schüller
2010-11-25 15:33                 ` Peter Schüller
2010-12-01  9:15                 ` Simon Kirby
2010-12-01  9:15                   ` Simon Kirby

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.