All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michel Lespinasse <walken@google.com>
To: Andrew Morton <akpm@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Hugh Dickins <hughd@google.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Michael Wolf <mjwolf@us.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 0/8] idle page tracking / working set estimation
Date: Thu, 22 Sep 2011 18:23:27 -0700	[thread overview]
Message-ID: <CANN689Gpn6hx0jXx1bzf_m_x9-ZQ4Uienfxcsyr=wV7ucZQXnQ@mail.gmail.com> (raw)
In-Reply-To: <20110922161325.f94f9c9e.akpm@google.com>

On Thu, Sep 22, 2011 at 4:13 PM, Andrew Morton <akpm@google.com> wrote:
> On Fri, 16 Sep 2011 20:39:05 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> Please comment on the following patches (which are against the v3.0 kernel).
>> We are using these to collect memory utilization statistics for each cgroup
>> accross many machines, and optimize job placement accordingly.
>
> Please consider updating /proc/kpageflags with the three new page
> flags.  If "yes": update.  If "no": explain/justify.

The PG_stale flag should probably be exported that way. I'll make sure
to add this, thanks for the suggestion!

I am not sure about PG_young and PG_idle since they indicate young
bits have been cleared in PTE(s) pointing to the page since the last
page_referenced() call. This seems rather internal - we don't export
PTE young bits in kpageflags currently, nor do we export anything that
would depend on when page_referenced() was last called.

> Which prompts the obvious: the whole feature could have been mostly
> implemented in userspace, using kpageflags.  Some additional kernel
> support would presumably be needed, but I'm not sure how much.
>
> If you haven't already done so, please sketch down what that
> infrastructure would look like and have a think about which approach is
> preferable?

kpageflags does not currently do a page_referenced() call to export
PTE young flags. For a userspace approach, we would have to add that.
Also we would want to actually clear the PTE young bits so that the
page doesn't show up as young again on the next kpageflags read - and,
we wouldn't want to affect the normal LRU algorithms while doing this,
so we'd end up introducing the same PG_young and PG_idle flags. The
next issue would be to find out which cgroup an idle page belongs to -
this could be done by adding a new kpagecgroup file, I suppose. Given
the above, we'd have the necessary components for a userspace approach
- but, the only part that we would really be able to remove from the
kernel side is the loop that scans physical pages and tallies the idle
ones into a per-cgroup count.

> What bugs me a bit about the proposal is its cgroups-centricity.  The
> question "how much memory is my application really using" comes up
> again and again.  It predates cgroups.  One way to answer that question
> is to force a massive amount of swapout on the entire machine, then let
> the system recover and take a look at your app's RSS two minutes later.
> This is very lame.
>
> It's a legitimate requirement, and the kstaled infrastructure puts a
> lot of things in place to answer it well.  But as far as I can tell it
> doesn't quite get over the line.  Then again, maybe it _does_ get
> there: put the application into a memcg all of its own, just for
> instrumentation purposes and then use kstaled to monitor it?

Yes, this is what I would recomment in this situation - create a
memory cgroup to move the application in, and see what kstaled
reports.

> <later> OK, I'm surprised to discover that kstaled is doing a physical
> scan and not a virtual one.  I assume it works, but I don't know why.
> But it makes the above requirement harder, methinks.

The reason for the physical scan is that a virtual scan would have
some limitations:
- it would only report memory that's virtually mapped - we do want
file pages to be classified as idle or not, regardless of how the file
gets accessed
- it may not work well with jobs that involve short lived processes.

> How does all this code get along with hugepages, btw?

They should get along now that Andreas updated get_page and
get_page_unless_zero to avoid the race with THP tail page splitting.

However, you're reminding me that I forgot to include the patch that
would make the accounting correct when we encounter a THP page (we
want to report the entire page as idle rather than just the first 4K,
and increment pfn appropriately for the page size)...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

WARNING: multiple messages have this Message-ID (diff)
From: Michel Lespinasse <walken@google.com>
To: Andrew Morton <akpm@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Hugh Dickins <hughd@google.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Michael Wolf <mjwolf@us.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 0/8] idle page tracking / working set estimation
Date: Thu, 22 Sep 2011 18:23:27 -0700	[thread overview]
Message-ID: <CANN689Gpn6hx0jXx1bzf_m_x9-ZQ4Uienfxcsyr=wV7ucZQXnQ@mail.gmail.com> (raw)
In-Reply-To: <20110922161325.f94f9c9e.akpm@google.com>

On Thu, Sep 22, 2011 at 4:13 PM, Andrew Morton <akpm@google.com> wrote:
> On Fri, 16 Sep 2011 20:39:05 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> Please comment on the following patches (which are against the v3.0 kernel).
>> We are using these to collect memory utilization statistics for each cgroup
>> accross many machines, and optimize job placement accordingly.
>
> Please consider updating /proc/kpageflags with the three new page
> flags.  If "yes": update.  If "no": explain/justify.

The PG_stale flag should probably be exported that way. I'll make sure
to add this, thanks for the suggestion!

I am not sure about PG_young and PG_idle since they indicate young
bits have been cleared in PTE(s) pointing to the page since the last
page_referenced() call. This seems rather internal - we don't export
PTE young bits in kpageflags currently, nor do we export anything that
would depend on when page_referenced() was last called.

> Which prompts the obvious: the whole feature could have been mostly
> implemented in userspace, using kpageflags.  Some additional kernel
> support would presumably be needed, but I'm not sure how much.
>
> If you haven't already done so, please sketch down what that
> infrastructure would look like and have a think about which approach is
> preferable?

kpageflags does not currently do a page_referenced() call to export
PTE young flags. For a userspace approach, we would have to add that.
Also we would want to actually clear the PTE young bits so that the
page doesn't show up as young again on the next kpageflags read - and,
we wouldn't want to affect the normal LRU algorithms while doing this,
so we'd end up introducing the same PG_young and PG_idle flags. The
next issue would be to find out which cgroup an idle page belongs to -
this could be done by adding a new kpagecgroup file, I suppose. Given
the above, we'd have the necessary components for a userspace approach
- but, the only part that we would really be able to remove from the
kernel side is the loop that scans physical pages and tallies the idle
ones into a per-cgroup count.

> What bugs me a bit about the proposal is its cgroups-centricity.  The
> question "how much memory is my application really using" comes up
> again and again.  It predates cgroups.  One way to answer that question
> is to force a massive amount of swapout on the entire machine, then let
> the system recover and take a look at your app's RSS two minutes later.
> This is very lame.
>
> It's a legitimate requirement, and the kstaled infrastructure puts a
> lot of things in place to answer it well.  But as far as I can tell it
> doesn't quite get over the line.  Then again, maybe it _does_ get
> there: put the application into a memcg all of its own, just for
> instrumentation purposes and then use kstaled to monitor it?

Yes, this is what I would recomment in this situation - create a
memory cgroup to move the application in, and see what kstaled
reports.

> <later> OK, I'm surprised to discover that kstaled is doing a physical
> scan and not a virtual one.  I assume it works, but I don't know why.
> But it makes the above requirement harder, methinks.

The reason for the physical scan is that a virtual scan would have
some limitations:
- it would only report memory that's virtually mapped - we do want
file pages to be classified as idle or not, regardless of how the file
gets accessed
- it may not work well with jobs that involve short lived processes.

> How does all this code get along with hugepages, btw?

They should get along now that Andreas updated get_page and
get_page_unless_zero to avoid the race with THP tail page splitting.

However, you're reminding me that I forgot to include the patch that
would make the accounting correct when we encounter a THP page (we
want to report the entire page as idle rather than just the first 4K,
and increment pfn appropriately for the page size)...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-09-23  1:23 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-17  3:39 [PATCH 0/8] idle page tracking / working set estimation Michel Lespinasse
2011-09-17  3:39 ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-17  3:44   ` Joe Perches
2011-09-17  3:44     ` Joe Perches
2011-09-17  4:51     ` Michel Lespinasse
2011-09-17  4:51       ` Michel Lespinasse
2011-09-20 19:05   ` Rik van Riel
2011-09-20 19:05     ` Rik van Riel
2011-09-21  2:51     ` Michel Lespinasse
2011-09-21  2:51       ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 2/8] kstaled: documentation and config option Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-20 21:23   ` Rik van Riel
2011-09-20 21:23     ` Rik van Riel
2011-09-23 19:27   ` Rik van Riel
2011-09-23 19:27     ` Rik van Riel
2011-09-17  3:39 ` [PATCH 3/8] kstaled: page_referenced_kstaled() and supporting infrastructure Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-20 19:36   ` Peter Zijlstra
2011-09-20 19:36     ` Peter Zijlstra
2011-09-17  3:39 ` [PATCH 4/8] kstaled: minimalistic implementation Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-22 23:14   ` Andrew Morton
2011-09-22 23:14     ` Andrew Morton
2011-09-23  8:37     ` Michel Lespinasse
2011-09-23  8:37       ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 5/8] kstaled: skip non-RAM regions Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 6/8] kstaled: rate limit pages scanned per second Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-22 23:15   ` Andrew Morton
2011-09-22 23:15     ` Andrew Morton
2011-09-23 10:18     ` Michel Lespinasse
2011-09-23 10:18       ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 7/8] kstaled: add histogram sampling functionality Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-22 23:15   ` Andrew Morton
2011-09-22 23:15     ` Andrew Morton
2011-09-23 10:26     ` Michel Lespinasse
2011-09-23 10:26       ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 8/8] kstaled: add incrementally updating stale page count Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-22 23:13 ` [PATCH 0/8] idle page tracking / working set estimation Andrew Morton
2011-09-22 23:13   ` Andrew Morton
2011-09-23  1:23   ` Michel Lespinasse [this message]
2011-09-23  1:23     ` Michel Lespinasse
2011-09-27 10:03 ` Balbir Singh
2011-09-27 10:03   ` Balbir Singh
2011-09-27 10:14   ` Michel Lespinasse
2011-09-27 10:14     ` Michel Lespinasse
2011-09-27 16:50     ` Balbir Singh
2011-09-27 16:50       ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANN689Gpn6hx0jXx1bzf_m_x9-ZQ4Uienfxcsyr=wV7ucZQXnQ@mail.gmail.com' \
    --to=walken@google.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=hughd@google.com \
    --cc=jweiner@redhat.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mjwolf@us.ibm.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.