From: Michel Lespinasse <walken@google.com> To: Andrew Morton <akpm@google.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>, Dave Hansen <dave@linux.vnet.ibm.com>, Andrea Arcangeli <aarcange@redhat.com>, Rik van Riel <riel@redhat.com>, Johannes Weiner <jweiner@redhat.com>, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>, Hugh Dickins <hughd@google.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Michael Wolf <mjwolf@us.ibm.com>, Andrew Morton <akpm@linux-foundation.org> Subject: Re: [PATCH 0/8] idle page tracking / working set estimation Date: Thu, 22 Sep 2011 18:23:27 -0700 [thread overview] Message-ID: <CANN689Gpn6hx0jXx1bzf_m_x9-ZQ4Uienfxcsyr=wV7ucZQXnQ@mail.gmail.com> (raw) In-Reply-To: <20110922161325.f94f9c9e.akpm@google.com> On Thu, Sep 22, 2011 at 4:13 PM, Andrew Morton <akpm@google.com> wrote: > On Fri, 16 Sep 2011 20:39:05 -0700 > Michel Lespinasse <walken@google.com> wrote: > >> Please comment on the following patches (which are against the v3.0 kernel). >> We are using these to collect memory utilization statistics for each cgroup >> accross many machines, and optimize job placement accordingly. > > Please consider updating /proc/kpageflags with the three new page > flags. If "yes": update. If "no": explain/justify. The PG_stale flag should probably be exported that way. I'll make sure to add this, thanks for the suggestion! I am not sure about PG_young and PG_idle since they indicate young bits have been cleared in PTE(s) pointing to the page since the last page_referenced() call. This seems rather internal - we don't export PTE young bits in kpageflags currently, nor do we export anything that would depend on when page_referenced() was last called. > Which prompts the obvious: the whole feature could have been mostly > implemented in userspace, using kpageflags. Some additional kernel > support would presumably be needed, but I'm not sure how much. > > If you haven't already done so, please sketch down what that > infrastructure would look like and have a think about which approach is > preferable? kpageflags does not currently do a page_referenced() call to export PTE young flags. For a userspace approach, we would have to add that. Also we would want to actually clear the PTE young bits so that the page doesn't show up as young again on the next kpageflags read - and, we wouldn't want to affect the normal LRU algorithms while doing this, so we'd end up introducing the same PG_young and PG_idle flags. The next issue would be to find out which cgroup an idle page belongs to - this could be done by adding a new kpagecgroup file, I suppose. Given the above, we'd have the necessary components for a userspace approach - but, the only part that we would really be able to remove from the kernel side is the loop that scans physical pages and tallies the idle ones into a per-cgroup count. > What bugs me a bit about the proposal is its cgroups-centricity. The > question "how much memory is my application really using" comes up > again and again. It predates cgroups. One way to answer that question > is to force a massive amount of swapout on the entire machine, then let > the system recover and take a look at your app's RSS two minutes later. > This is very lame. > > It's a legitimate requirement, and the kstaled infrastructure puts a > lot of things in place to answer it well. But as far as I can tell it > doesn't quite get over the line. Then again, maybe it _does_ get > there: put the application into a memcg all of its own, just for > instrumentation purposes and then use kstaled to monitor it? Yes, this is what I would recomment in this situation - create a memory cgroup to move the application in, and see what kstaled reports. > <later> OK, I'm surprised to discover that kstaled is doing a physical > scan and not a virtual one. I assume it works, but I don't know why. > But it makes the above requirement harder, methinks. The reason for the physical scan is that a virtual scan would have some limitations: - it would only report memory that's virtually mapped - we do want file pages to be classified as idle or not, regardless of how the file gets accessed - it may not work well with jobs that involve short lived processes. > How does all this code get along with hugepages, btw? They should get along now that Andreas updated get_page and get_page_unless_zero to avoid the race with THP tail page splitting. However, you're reminding me that I forgot to include the patch that would make the accounting correct when we encounter a THP page (we want to report the entire page as idle rather than just the first 4K, and increment pfn appropriately for the page size)... -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies.
WARNING: multiple messages have this Message-ID (diff)
From: Michel Lespinasse <walken@google.com> To: Andrew Morton <akpm@google.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>, Dave Hansen <dave@linux.vnet.ibm.com>, Andrea Arcangeli <aarcange@redhat.com>, Rik van Riel <riel@redhat.com>, Johannes Weiner <jweiner@redhat.com>, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>, Hugh Dickins <hughd@google.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Michael Wolf <mjwolf@us.ibm.com>, Andrew Morton <akpm@linux-foundation.org> Subject: Re: [PATCH 0/8] idle page tracking / working set estimation Date: Thu, 22 Sep 2011 18:23:27 -0700 [thread overview] Message-ID: <CANN689Gpn6hx0jXx1bzf_m_x9-ZQ4Uienfxcsyr=wV7ucZQXnQ@mail.gmail.com> (raw) In-Reply-To: <20110922161325.f94f9c9e.akpm@google.com> On Thu, Sep 22, 2011 at 4:13 PM, Andrew Morton <akpm@google.com> wrote: > On Fri, 16 Sep 2011 20:39:05 -0700 > Michel Lespinasse <walken@google.com> wrote: > >> Please comment on the following patches (which are against the v3.0 kernel). >> We are using these to collect memory utilization statistics for each cgroup >> accross many machines, and optimize job placement accordingly. > > Please consider updating /proc/kpageflags with the three new page > flags. If "yes": update. If "no": explain/justify. The PG_stale flag should probably be exported that way. I'll make sure to add this, thanks for the suggestion! I am not sure about PG_young and PG_idle since they indicate young bits have been cleared in PTE(s) pointing to the page since the last page_referenced() call. This seems rather internal - we don't export PTE young bits in kpageflags currently, nor do we export anything that would depend on when page_referenced() was last called. > Which prompts the obvious: the whole feature could have been mostly > implemented in userspace, using kpageflags. Some additional kernel > support would presumably be needed, but I'm not sure how much. > > If you haven't already done so, please sketch down what that > infrastructure would look like and have a think about which approach is > preferable? kpageflags does not currently do a page_referenced() call to export PTE young flags. For a userspace approach, we would have to add that. Also we would want to actually clear the PTE young bits so that the page doesn't show up as young again on the next kpageflags read - and, we wouldn't want to affect the normal LRU algorithms while doing this, so we'd end up introducing the same PG_young and PG_idle flags. The next issue would be to find out which cgroup an idle page belongs to - this could be done by adding a new kpagecgroup file, I suppose. Given the above, we'd have the necessary components for a userspace approach - but, the only part that we would really be able to remove from the kernel side is the loop that scans physical pages and tallies the idle ones into a per-cgroup count. > What bugs me a bit about the proposal is its cgroups-centricity. The > question "how much memory is my application really using" comes up > again and again. It predates cgroups. One way to answer that question > is to force a massive amount of swapout on the entire machine, then let > the system recover and take a look at your app's RSS two minutes later. > This is very lame. > > It's a legitimate requirement, and the kstaled infrastructure puts a > lot of things in place to answer it well. But as far as I can tell it > doesn't quite get over the line. Then again, maybe it _does_ get > there: put the application into a memcg all of its own, just for > instrumentation purposes and then use kstaled to monitor it? Yes, this is what I would recomment in this situation - create a memory cgroup to move the application in, and see what kstaled reports. > <later> OK, I'm surprised to discover that kstaled is doing a physical > scan and not a virtual one. I assume it works, but I don't know why. > But it makes the above requirement harder, methinks. The reason for the physical scan is that a virtual scan would have some limitations: - it would only report memory that's virtually mapped - we do want file pages to be classified as idle or not, regardless of how the file gets accessed - it may not work well with jobs that involve short lived processes. > How does all this code get along with hugepages, btw? They should get along now that Andreas updated get_page and get_page_unless_zero to avoid the race with THP tail page splitting. However, you're reminding me that I forgot to include the patch that would make the accounting correct when we encounter a THP page (we want to report the entire page as idle rather than just the first 4K, and increment pfn appropriately for the page size)... -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-09-23 1:23 UTC|newest] Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top 2011-09-17 3:39 [PATCH 0/8] idle page tracking / working set estimation Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-17 3:39 ` [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-17 3:44 ` Joe Perches 2011-09-17 3:44 ` Joe Perches 2011-09-17 4:51 ` Michel Lespinasse 2011-09-17 4:51 ` Michel Lespinasse 2011-09-20 19:05 ` Rik van Riel 2011-09-20 19:05 ` Rik van Riel 2011-09-21 2:51 ` Michel Lespinasse 2011-09-21 2:51 ` Michel Lespinasse 2011-09-17 3:39 ` [PATCH 2/8] kstaled: documentation and config option Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-20 21:23 ` Rik van Riel 2011-09-20 21:23 ` Rik van Riel 2011-09-23 19:27 ` Rik van Riel 2011-09-23 19:27 ` Rik van Riel 2011-09-17 3:39 ` [PATCH 3/8] kstaled: page_referenced_kstaled() and supporting infrastructure Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-20 19:36 ` Peter Zijlstra 2011-09-20 19:36 ` Peter Zijlstra 2011-09-17 3:39 ` [PATCH 4/8] kstaled: minimalistic implementation Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-22 23:14 ` Andrew Morton 2011-09-22 23:14 ` Andrew Morton 2011-09-23 8:37 ` Michel Lespinasse 2011-09-23 8:37 ` Michel Lespinasse 2011-09-17 3:39 ` [PATCH 5/8] kstaled: skip non-RAM regions Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-17 3:39 ` [PATCH 6/8] kstaled: rate limit pages scanned per second Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-22 23:15 ` Andrew Morton 2011-09-22 23:15 ` Andrew Morton 2011-09-23 10:18 ` Michel Lespinasse 2011-09-23 10:18 ` Michel Lespinasse 2011-09-17 3:39 ` [PATCH 7/8] kstaled: add histogram sampling functionality Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-22 23:15 ` Andrew Morton 2011-09-22 23:15 ` Andrew Morton 2011-09-23 10:26 ` Michel Lespinasse 2011-09-23 10:26 ` Michel Lespinasse 2011-09-17 3:39 ` [PATCH 8/8] kstaled: add incrementally updating stale page count Michel Lespinasse 2011-09-17 3:39 ` Michel Lespinasse 2011-09-22 23:13 ` [PATCH 0/8] idle page tracking / working set estimation Andrew Morton 2011-09-22 23:13 ` Andrew Morton 2011-09-23 1:23 ` Michel Lespinasse [this message] 2011-09-23 1:23 ` Michel Lespinasse 2011-09-27 10:03 ` Balbir Singh 2011-09-27 10:03 ` Balbir Singh 2011-09-27 10:14 ` Michel Lespinasse 2011-09-27 10:14 ` Michel Lespinasse 2011-09-27 16:50 ` Balbir Singh 2011-09-27 16:50 ` Balbir Singh
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CANN689Gpn6hx0jXx1bzf_m_x9-ZQ4Uienfxcsyr=wV7ucZQXnQ@mail.gmail.com' \ --to=walken@google.com \ --cc=a.p.zijlstra@chello.nl \ --cc=aarcange@redhat.com \ --cc=akpm@google.com \ --cc=akpm@linux-foundation.org \ --cc=dave@linux.vnet.ibm.com \ --cc=hughd@google.com \ --cc=jweiner@redhat.com \ --cc=kamezawa.hiroyu@jp.fujitsu.com \ --cc=kosaki.motohiro@jp.fujitsu.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mjwolf@us.ibm.com \ --cc=riel@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.