All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] idle memory tracking
@ 2015-03-18 20:44 ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

Hi,

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently. Currently, the only means to estimate the amount of idle
memory provided by the kernel is /proc/PID/clear_refs. However, it has
two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

Back in 2011 an attempt was made by Michel Lespinasse to improve the
situation (see http://lwn.net/Articles/459269/). He proposed a kernel
space daemon which would periodically scan physical address range,
testing and clearing ACCESS/YOUNG PTE bits, and counting pages that had
not been referenced since the last scan. The daemon avoided interference
with the page reclaimer by setting the new PG_young flag on referenced
pages and making page_referenced() return >= 1 if PG_young was set.

This patch set reuses the idea of Michel's patch set, but the
implementation is quite different. Instead of introducing a kernel space
daemon, it only provides the userspace with the necessary means to
estimate the amount of idle memory, leaving the daemon to be implemented
in the userspace. In order to achieve that, it adds two new proc files,
/proc/kpagecgroup and /proc/sys/vm/set_idle, and extends the clear_refs
interface.

Usage:

 1. Write 1 to /proc/sys/vm/set_idle.

    This will set the IDLE flag for all user pages. The IDLE flag is cleared
    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
    to the page. It is also cleared when the page is freed.

 2. Wait some time.

 3. Write 6 to /proc/PID/clear_refs for each PID of interest.

    This will clear the IDLE flag for recently accessed pages.

 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
    certain application/container.

An example of using this new interface is below. It is a script that
counts the number of pages charged to a specified cgroup that have not
been accessed for a given time interval.

---- BEGIN SCRIPT ----
#! /usr/bin/python
#

import struct
import sys
import os
import stat
import time

def get_end_pfn():
    f = open("/proc/zoneinfo", "r")
    end_pfn = 0
    for l in f.readlines():
        l = l.split()
        if l[0] == "spanned":
            end_pfn = int(l[1])
        elif l[0] == "start_pfn:":
            end_pfn += int(l[1])
    return end_pfn

def set_idle():
    open("/proc/sys/vm/set_idle", "w").write("1")

def clear_refs(target_cg_path):
    procs = open(target_cg_path + "/cgroup.procs", "r")
    for pid in procs.readlines():
        try:
            with open("/proc/" + pid.rstrip() + "/clear_refs", "w") as f:
                f.write("6")
        except IOError as e:
            print "Failed to clear_refs for pid " + pid + ": " + str(e)

def count_idle(target_cg_path):
    target_cg_ino = os.stat(target_cg_path)[stat.ST_INO]

    pgflags = open("/proc/kpageflags", "rb")
    pgcgroup = open("/proc/kpagecgroup", "rb")

    nidle = 0

    for i in range(0, get_end_pfn()):
        cg_ino = struct.unpack('Q', pgcgroup.read(8))[0]
        flags = struct.unpack('Q', pgflags.read(8))[0]

        if cg_ino != target_cg_ino:
            continue

        # unevictable?
        if (flags >> 18) & 1 != 0:
            continue

        # huge?
        if (flags >> 22) & 1 != 0:
            npages = 512
        else:
            npages = 1

        # idle?
        if (flags >> 25) & 1 != 0:
            nidle += npages

    return nidle

if len(sys.argv) <> 3:
    print "Usage: %s cgroup_path scan_interval" % sys.argv[0]
    exit(1)

cg_path = sys.argv[1]
scan_interval = int(sys.argv[2])

while True:
    set_idle()
    time.sleep(scan_interval)
    clear_refs(cg_path)
    print count_idle(cg_path)
---- END SCRIPT ----

Thanks,

Vladimir Davydov (3):
  memcg: add page_cgroup_ino helper
  proc: add kpagecgroup file
  mm: idle memory tracking

 Documentation/filesystems/proc.txt     |    3 +
 Documentation/vm/00-INDEX              |    2 +
 Documentation/vm/idle_mem_tracking.txt |   23 +++++++
 Documentation/vm/pagemap.txt           |   10 ++-
 fs/proc/Kconfig                        |    5 +-
 fs/proc/page.c                         |  107 ++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   22 ++++++-
 include/linux/memcontrol.h             |    8 +--
 include/linux/page-flags.h             |   12 ++++
 include/uapi/linux/kernel-page-flags.h |    1 +
 kernel/sysctl.c                        |   14 +++++
 mm/Kconfig                             |   12 ++++
 mm/debug.c                             |    4 ++
 mm/hwpoison-inject.c                   |    5 +-
 mm/memcontrol.c                        |   61 ++++++++----------
 mm/memory-failure.c                    |   16 +----
 mm/rmap.c                              |    7 +++
 mm/swap.c                              |    2 +
 18 files changed, 248 insertions(+), 66 deletions(-)
 create mode 100644 Documentation/vm/idle_mem_tracking.txt

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] idle memory tracking
@ 2015-03-18 20:44 ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

Hi,

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently. Currently, the only means to estimate the amount of idle
memory provided by the kernel is /proc/PID/clear_refs. However, it has
two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

Back in 2011 an attempt was made by Michel Lespinasse to improve the
situation (see http://lwn.net/Articles/459269/). He proposed a kernel
space daemon which would periodically scan physical address range,
testing and clearing ACCESS/YOUNG PTE bits, and counting pages that had
not been referenced since the last scan. The daemon avoided interference
with the page reclaimer by setting the new PG_young flag on referenced
pages and making page_referenced() return >= 1 if PG_young was set.

This patch set reuses the idea of Michel's patch set, but the
implementation is quite different. Instead of introducing a kernel space
daemon, it only provides the userspace with the necessary means to
estimate the amount of idle memory, leaving the daemon to be implemented
in the userspace. In order to achieve that, it adds two new proc files,
/proc/kpagecgroup and /proc/sys/vm/set_idle, and extends the clear_refs
interface.

Usage:

 1. Write 1 to /proc/sys/vm/set_idle.

    This will set the IDLE flag for all user pages. The IDLE flag is cleared
    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
    to the page. It is also cleared when the page is freed.

 2. Wait some time.

 3. Write 6 to /proc/PID/clear_refs for each PID of interest.

    This will clear the IDLE flag for recently accessed pages.

 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
    certain application/container.

An example of using this new interface is below. It is a script that
counts the number of pages charged to a specified cgroup that have not
been accessed for a given time interval.

---- BEGIN SCRIPT ----
#! /usr/bin/python
#

import struct
import sys
import os
import stat
import time

def get_end_pfn():
    f = open("/proc/zoneinfo", "r")
    end_pfn = 0
    for l in f.readlines():
        l = l.split()
        if l[0] == "spanned":
            end_pfn = int(l[1])
        elif l[0] == "start_pfn:":
            end_pfn += int(l[1])
    return end_pfn

def set_idle():
    open("/proc/sys/vm/set_idle", "w").write("1")

def clear_refs(target_cg_path):
    procs = open(target_cg_path + "/cgroup.procs", "r")
    for pid in procs.readlines():
        try:
            with open("/proc/" + pid.rstrip() + "/clear_refs", "w") as f:
                f.write("6")
        except IOError as e:
            print "Failed to clear_refs for pid " + pid + ": " + str(e)

def count_idle(target_cg_path):
    target_cg_ino = os.stat(target_cg_path)[stat.ST_INO]

    pgflags = open("/proc/kpageflags", "rb")
    pgcgroup = open("/proc/kpagecgroup", "rb")

    nidle = 0

    for i in range(0, get_end_pfn()):
        cg_ino = struct.unpack('Q', pgcgroup.read(8))[0]
        flags = struct.unpack('Q', pgflags.read(8))[0]

        if cg_ino != target_cg_ino:
            continue

        # unevictable?
        if (flags >> 18) & 1 != 0:
            continue

        # huge?
        if (flags >> 22) & 1 != 0:
            npages = 512
        else:
            npages = 1

        # idle?
        if (flags >> 25) & 1 != 0:
            nidle += npages

    return nidle

if len(sys.argv) <> 3:
    print "Usage: %s cgroup_path scan_interval" % sys.argv[0]
    exit(1)

cg_path = sys.argv[1]
scan_interval = int(sys.argv[2])

while True:
    set_idle()
    time.sleep(scan_interval)
    clear_refs(cg_path)
    print count_idle(cg_path)
---- END SCRIPT ----

Thanks,

Vladimir Davydov (3):
  memcg: add page_cgroup_ino helper
  proc: add kpagecgroup file
  mm: idle memory tracking

 Documentation/filesystems/proc.txt     |    3 +
 Documentation/vm/00-INDEX              |    2 +
 Documentation/vm/idle_mem_tracking.txt |   23 +++++++
 Documentation/vm/pagemap.txt           |   10 ++-
 fs/proc/Kconfig                        |    5 +-
 fs/proc/page.c                         |  107 ++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   22 ++++++-
 include/linux/memcontrol.h             |    8 +--
 include/linux/page-flags.h             |   12 ++++
 include/uapi/linux/kernel-page-flags.h |    1 +
 kernel/sysctl.c                        |   14 +++++
 mm/Kconfig                             |   12 ++++
 mm/debug.c                             |    4 ++
 mm/hwpoison-inject.c                   |    5 +-
 mm/memcontrol.c                        |   61 ++++++++----------
 mm/memory-failure.c                    |   16 +----
 mm/rmap.c                              |    7 +++
 mm/swap.c                              |    2 +
 18 files changed, 248 insertions(+), 66 deletions(-)
 create mode 100644 Documentation/vm/idle_mem_tracking.txt

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] memcg: add page_cgroup_ino helper
  2015-03-18 20:44 ` Vladimir Davydov
  (?)
@ 2015-03-18 20:44   ` Vladimir Davydov
  -1 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
cgroup_ino() on the cgroup returned. This looks bulky. Since in the next
patch I need to get the ino of the memory cgroup a page is charged to
too, in this patch I introduce the page_cgroup_ino() helper.

Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup->memory (i.e. page->mem_cgroup != NULL), and for others it
returns 0, while try_get_mem_cgroup_page(), used by hwpoison before, may
extract the cgroup from a swapcache readahead page too. Ignoring
swapcache readahead pages allows to call page_cgroup_ino() on unlocked
pages, which is nice. Hwpoison users will hardly see any difference.

Another difference between try_get_mem_cgroup_page() and
page_cgroup_ino() is that the latter works for offline memory cgroups
while the former does not, which is crucial for the next patch.

Since try_get_mem_cgroup_page() is not used by anyone else, this patch
removes this function. Also, it makes hwpoison memcg filter depend on
CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP (I've no idea why it was made
dependant on CONFIG_MEMCG_SWAP initially).

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |    8 ++----
 mm/hwpoison-inject.c       |    5 +---
 mm/memcontrol.c            |   61 ++++++++++++++++++--------------------------
 mm/memory-failure.c        |   16 ++----------
 4 files changed, 30 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..9262a8407af7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -91,7 +91,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 			      struct mem_cgroup *root);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -192,6 +191,8 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
+unsigned long page_cgroup_ino(struct page *page);
+
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -252,11 +253,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 329caf56df22..df63c3133d70 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * do a racy check with elevated page count, to make sure PG_hwpoison
 	 * will only be set for the targeted owner (or on a free page).
-	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
-	lock_page(hpage);
 	err = hwpoison_filter(hpage);
-	unlock_page(hpage);
 	if (err)
 		return 0;
 
@@ -123,7 +120,7 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 	dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
 				    hwpoison_dir, &hwpoison_filter_memcg);
 	if (!dentry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74a9641d8f9f..3ecbeda0a3f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2349,40 +2349,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges.  If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	struct mem_cgroup *memcg;
-	unsigned short id;
-	swp_entry_t ent;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	memcg = page->mem_cgroup;
-	if (memcg) {
-		if (!css_tryget_online(&memcg->css))
-			memcg = NULL;
-	} else if (PageSwapCache(page)) {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
-		rcu_read_lock();
-		memcg = mem_cgroup_from_id(id);
-		if (memcg && !css_tryget_online(&memcg->css))
-			memcg = NULL;
-		rcu_read_unlock();
-	}
-	return memcg;
-}
-
 static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
@@ -2774,6 +2740,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+unsigned long page_cgroup_ino(struct page *page)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ino = 0;
+
+	rcu_read_lock();
+	memcg = ACCESS_ONCE(page->mem_cgroup);
+	if (memcg)
+		ino = cgroup_ino(memcg->css.cgroup);
+	rcu_read_unlock();
+	return ino;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
 					 bool charge)
@@ -5482,8 +5461,18 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	if (do_swap_account && PageSwapCache(page))
-		memcg = try_get_mem_cgroup_from_page(page);
+	if (do_swap_account && PageSwapCache(page)) {
+		swp_entry_t ent = { .val = page_private(page), };
+		unsigned short id = lookup_swap_cgroup_id(ent);
+
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(id);
+		if (memcg && !css_tryget_online(&memcg->css))
+			memcg = NULL;
+		rcu_read_unlock();
+	}
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d487f8dc6d39..7414f24cefdf 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -128,27 +128,15 @@ static int hwpoison_filter_flags(struct page *p)
  * can only guarantee that the page either belongs to the memcg tasks, or is
  * a freed page.
  */
-#ifdef	CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 u64 hwpoison_filter_memcg;
 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
 static int hwpoison_filter_task(struct page *p)
 {
-	struct mem_cgroup *mem;
-	struct cgroup_subsys_state *css;
-	unsigned long ino;
-
 	if (!hwpoison_filter_memcg)
 		return 0;
 
-	mem = try_get_mem_cgroup_from_page(p);
-	if (!mem)
-		return -EINVAL;
-
-	css = mem_cgroup_css(mem);
-	ino = cgroup_ino(css->cgroup);
-	css_put(css);
-
-	if (ino != hwpoison_filter_memcg)
+	if (page_cgroup_ino(p) != hwpoison_filter_memcg)
 		return -EINVAL;
 
 	return 0;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 1/3] memcg: add page_cgroup_ino helper
@ 2015-03-18 20:44   ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
cgroup_ino() on the cgroup returned. This looks bulky. Since in the next
patch I need to get the ino of the memory cgroup a page is charged to
too, in this patch I introduce the page_cgroup_ino() helper.

Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup->memory (i.e. page->mem_cgroup != NULL), and for others it
returns 0, while try_get_mem_cgroup_page(), used by hwpoison before, may
extract the cgroup from a swapcache readahead page too. Ignoring
swapcache readahead pages allows to call page_cgroup_ino() on unlocked
pages, which is nice. Hwpoison users will hardly see any difference.

Another difference between try_get_mem_cgroup_page() and
page_cgroup_ino() is that the latter works for offline memory cgroups
while the former does not, which is crucial for the next patch.

Since try_get_mem_cgroup_page() is not used by anyone else, this patch
removes this function. Also, it makes hwpoison memcg filter depend on
CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP (I've no idea why it was made
dependant on CONFIG_MEMCG_SWAP initially).

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |    8 ++----
 mm/hwpoison-inject.c       |    5 +---
 mm/memcontrol.c            |   61 ++++++++++++++++++--------------------------
 mm/memory-failure.c        |   16 ++----------
 4 files changed, 30 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..9262a8407af7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -91,7 +91,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 			      struct mem_cgroup *root);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -192,6 +191,8 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
+unsigned long page_cgroup_ino(struct page *page);
+
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -252,11 +253,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 329caf56df22..df63c3133d70 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * do a racy check with elevated page count, to make sure PG_hwpoison
 	 * will only be set for the targeted owner (or on a free page).
-	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
-	lock_page(hpage);
 	err = hwpoison_filter(hpage);
-	unlock_page(hpage);
 	if (err)
 		return 0;
 
@@ -123,7 +120,7 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 	dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
 				    hwpoison_dir, &hwpoison_filter_memcg);
 	if (!dentry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74a9641d8f9f..3ecbeda0a3f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2349,40 +2349,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges.  If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	struct mem_cgroup *memcg;
-	unsigned short id;
-	swp_entry_t ent;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	memcg = page->mem_cgroup;
-	if (memcg) {
-		if (!css_tryget_online(&memcg->css))
-			memcg = NULL;
-	} else if (PageSwapCache(page)) {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
-		rcu_read_lock();
-		memcg = mem_cgroup_from_id(id);
-		if (memcg && !css_tryget_online(&memcg->css))
-			memcg = NULL;
-		rcu_read_unlock();
-	}
-	return memcg;
-}
-
 static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
@@ -2774,6 +2740,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+unsigned long page_cgroup_ino(struct page *page)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ino = 0;
+
+	rcu_read_lock();
+	memcg = ACCESS_ONCE(page->mem_cgroup);
+	if (memcg)
+		ino = cgroup_ino(memcg->css.cgroup);
+	rcu_read_unlock();
+	return ino;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
 					 bool charge)
@@ -5482,8 +5461,18 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	if (do_swap_account && PageSwapCache(page))
-		memcg = try_get_mem_cgroup_from_page(page);
+	if (do_swap_account && PageSwapCache(page)) {
+		swp_entry_t ent = { .val = page_private(page), };
+		unsigned short id = lookup_swap_cgroup_id(ent);
+
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(id);
+		if (memcg && !css_tryget_online(&memcg->css))
+			memcg = NULL;
+		rcu_read_unlock();
+	}
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d487f8dc6d39..7414f24cefdf 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -128,27 +128,15 @@ static int hwpoison_filter_flags(struct page *p)
  * can only guarantee that the page either belongs to the memcg tasks, or is
  * a freed page.
  */
-#ifdef	CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 u64 hwpoison_filter_memcg;
 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
 static int hwpoison_filter_task(struct page *p)
 {
-	struct mem_cgroup *mem;
-	struct cgroup_subsys_state *css;
-	unsigned long ino;
-
 	if (!hwpoison_filter_memcg)
 		return 0;
 
-	mem = try_get_mem_cgroup_from_page(p);
-	if (!mem)
-		return -EINVAL;
-
-	css = mem_cgroup_css(mem);
-	ino = cgroup_ino(css->cgroup);
-	css_put(css);
-
-	if (ino != hwpoison_filter_memcg)
+	if (page_cgroup_ino(p) != hwpoison_filter_memcg)
 		return -EINVAL;
 
 	return 0;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 1/3] memcg: add page_cgroup_ino helper
@ 2015-03-18 20:44   ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
cgroup_ino() on the cgroup returned. This looks bulky. Since in the next
patch I need to get the ino of the memory cgroup a page is charged to
too, in this patch I introduce the page_cgroup_ino() helper.

Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup->memory (i.e. page->mem_cgroup != NULL), and for others it
returns 0, while try_get_mem_cgroup_page(), used by hwpoison before, may
extract the cgroup from a swapcache readahead page too. Ignoring
swapcache readahead pages allows to call page_cgroup_ino() on unlocked
pages, which is nice. Hwpoison users will hardly see any difference.

Another difference between try_get_mem_cgroup_page() and
page_cgroup_ino() is that the latter works for offline memory cgroups
while the former does not, which is crucial for the next patch.

Since try_get_mem_cgroup_page() is not used by anyone else, this patch
removes this function. Also, it makes hwpoison memcg filter depend on
CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP (I've no idea why it was made
dependant on CONFIG_MEMCG_SWAP initially).

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |    8 ++----
 mm/hwpoison-inject.c       |    5 +---
 mm/memcontrol.c            |   61 ++++++++++++++++++--------------------------
 mm/memory-failure.c        |   16 ++----------
 4 files changed, 30 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..9262a8407af7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -91,7 +91,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 			      struct mem_cgroup *root);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -192,6 +191,8 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
+unsigned long page_cgroup_ino(struct page *page);
+
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -252,11 +253,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 329caf56df22..df63c3133d70 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * do a racy check with elevated page count, to make sure PG_hwpoison
 	 * will only be set for the targeted owner (or on a free page).
-	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
-	lock_page(hpage);
 	err = hwpoison_filter(hpage);
-	unlock_page(hpage);
 	if (err)
 		return 0;
 
@@ -123,7 +120,7 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 	dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
 				    hwpoison_dir, &hwpoison_filter_memcg);
 	if (!dentry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74a9641d8f9f..3ecbeda0a3f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2349,40 +2349,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges.  If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	struct mem_cgroup *memcg;
-	unsigned short id;
-	swp_entry_t ent;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	memcg = page->mem_cgroup;
-	if (memcg) {
-		if (!css_tryget_online(&memcg->css))
-			memcg = NULL;
-	} else if (PageSwapCache(page)) {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
-		rcu_read_lock();
-		memcg = mem_cgroup_from_id(id);
-		if (memcg && !css_tryget_online(&memcg->css))
-			memcg = NULL;
-		rcu_read_unlock();
-	}
-	return memcg;
-}
-
 static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
@@ -2774,6 +2740,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+unsigned long page_cgroup_ino(struct page *page)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ino = 0;
+
+	rcu_read_lock();
+	memcg = ACCESS_ONCE(page->mem_cgroup);
+	if (memcg)
+		ino = cgroup_ino(memcg->css.cgroup);
+	rcu_read_unlock();
+	return ino;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
 					 bool charge)
@@ -5482,8 +5461,18 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	if (do_swap_account && PageSwapCache(page))
-		memcg = try_get_mem_cgroup_from_page(page);
+	if (do_swap_account && PageSwapCache(page)) {
+		swp_entry_t ent = { .val = page_private(page), };
+		unsigned short id = lookup_swap_cgroup_id(ent);
+
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(id);
+		if (memcg && !css_tryget_online(&memcg->css))
+			memcg = NULL;
+		rcu_read_unlock();
+	}
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d487f8dc6d39..7414f24cefdf 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -128,27 +128,15 @@ static int hwpoison_filter_flags(struct page *p)
  * can only guarantee that the page either belongs to the memcg tasks, or is
  * a freed page.
  */
-#ifdef	CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 u64 hwpoison_filter_memcg;
 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
 static int hwpoison_filter_task(struct page *p)
 {
-	struct mem_cgroup *mem;
-	struct cgroup_subsys_state *css;
-	unsigned long ino;
-
 	if (!hwpoison_filter_memcg)
 		return 0;
 
-	mem = try_get_mem_cgroup_from_page(p);
-	if (!mem)
-		return -EINVAL;
-
-	css = mem_cgroup_css(mem);
-	ino = cgroup_ino(css->cgroup);
-	css_put(css);
-
-	if (ino != hwpoison_filter_memcg)
+	if (page_cgroup_ino(p) != hwpoison_filter_memcg)
 		return -EINVAL;
 
 	return 0;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] proc: add kpagecgroup file
  2015-03-18 20:44 ` Vladimir Davydov
@ 2015-03-18 20:44   ` Vladimir Davydov
  -1 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt |    6 ++++-
 fs/proc/Kconfig              |    5 ++--
 fs/proc/page.c               |   53 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6fbd55ef6b45..1ddfa1367b03 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are three components to pagemap:
+There are four components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
     23. BALLOON
     24. ZERO_PAGE
 
+ * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
+   memory cgroup each page is charged to, indexed by PFN. Only available when
+   CONFIG_MEMCG is set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 2183fcf41d59..5021a2935bb9 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -69,5 +69,6 @@ config PROC_PAGE_MONITOR
  	help
 	  Various /proc files exist to monitor process memory utilization:
 	  /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
-	  /proc/kpagecount, and /proc/kpageflags. Disabling these
-          interfaces will reduce the size of the kernel by approximately 4kb.
+	  /proc/kpagecount, /proc/kpageflags, and /proc/kpagecgroup.
+	  Disabling these interfaces will reduce the size of the kernel
+	  by approximately 4kb.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
 #include <linux/kernel-page-flags.h>
 #include <asm/uaccess.h>
 #include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
 	.read = kpageflags_read,
 };
 
+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *ppage;
+	unsigned long src = *ppos;
+	unsigned long pfn;
+	ssize_t ret = 0;
+	u64 ino;
+
+	pfn = src / KPMSIZE;
+	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+	if (src & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	while (count > 0) {
+		if (pfn_valid(pfn))
+			ppage = pfn_to_page(pfn);
+		else
+			ppage = NULL;
+
+		if (ppage)
+			ino = page_cgroup_ino(ppage);
+		else
+			ino = 0;
+
+		if (put_user(ino, out)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		pfn++;
+		out++;
+		count -= KPMSIZE;
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+	.llseek = mem_lseek,
+	.read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
 	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
 	return 0;
 }
 fs_initcall(proc_page_init);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] proc: add kpagecgroup file
@ 2015-03-18 20:44   ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt |    6 ++++-
 fs/proc/Kconfig              |    5 ++--
 fs/proc/page.c               |   53 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6fbd55ef6b45..1ddfa1367b03 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are three components to pagemap:
+There are four components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
     23. BALLOON
     24. ZERO_PAGE
 
+ * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
+   memory cgroup each page is charged to, indexed by PFN. Only available when
+   CONFIG_MEMCG is set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 2183fcf41d59..5021a2935bb9 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -69,5 +69,6 @@ config PROC_PAGE_MONITOR
  	help
 	  Various /proc files exist to monitor process memory utilization:
 	  /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
-	  /proc/kpagecount, and /proc/kpageflags. Disabling these
-          interfaces will reduce the size of the kernel by approximately 4kb.
+	  /proc/kpagecount, /proc/kpageflags, and /proc/kpagecgroup.
+	  Disabling these interfaces will reduce the size of the kernel
+	  by approximately 4kb.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
 #include <linux/kernel-page-flags.h>
 #include <asm/uaccess.h>
 #include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
 	.read = kpageflags_read,
 };
 
+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *ppage;
+	unsigned long src = *ppos;
+	unsigned long pfn;
+	ssize_t ret = 0;
+	u64 ino;
+
+	pfn = src / KPMSIZE;
+	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+	if (src & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	while (count > 0) {
+		if (pfn_valid(pfn))
+			ppage = pfn_to_page(pfn);
+		else
+			ppage = NULL;
+
+		if (ppage)
+			ino = page_cgroup_ino(ppage);
+		else
+			ino = 0;
+
+		if (put_user(ino, out)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		pfn++;
+		out++;
+		count -= KPMSIZE;
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+	.llseek = mem_lseek,
+	.read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
 	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
 	return 0;
 }
 fs_initcall(proc_page_init);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] mm: idle memory tracking
  2015-03-18 20:44 ` Vladimir Davydov
@ 2015-03-18 20:44   ` Vladimir Davydov
  -1 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently. Currently, the only means to estimate the amount of idle
memory provided by the kernel is /proc/PID/clear_refs. However, it has
two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

This patch attempts to provide the userspace with the means to track
idle memory without the above mentioned limitations.

Usage:

 1. Write 1 to /proc/sys/vm/set_idle.

    This will set the IDLE flag for all user pages. The IDLE flag is cleared
    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
    to the page. It is also cleared when the page is freed.

 2. Wait some time.

 3. Write 6 to /proc/PID/clear_refs for each PID of interest.

    This will clear the IDLE flag for recently accessed pages.

 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
    certain application/container.

To avoid interference with the memory reclaimer, this patch adds the
PG_young flag in addition to PG_idle. The PG_young flag is set if the
ACCESS/YOUNG bit is cleared at step 3. page_referenced() returns >= 1 if
the page has the PG_young flag set.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/filesystems/proc.txt     |    3 ++
 Documentation/vm/00-INDEX              |    2 ++
 Documentation/vm/idle_mem_tracking.txt |   23 ++++++++++++++
 Documentation/vm/pagemap.txt           |    4 +++
 fs/proc/page.c                         |   54 ++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   22 +++++++++++--
 include/linux/page-flags.h             |   12 +++++++
 include/uapi/linux/kernel-page-flags.h |    1 +
 kernel/sysctl.c                        |   14 +++++++++
 mm/Kconfig                             |   12 +++++++
 mm/debug.c                             |    4 +++
 mm/rmap.c                              |    7 +++++
 mm/swap.c                              |    2 ++
 13 files changed, 157 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/vm/idle_mem_tracking.txt

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 8e36c7e3c345..9880ddb0383f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -500,6 +500,9 @@ To reset the peak resident set size ("high water mark") to the process's
 current value:
     > echo 5 > /proc/PID/clear_refs
 
+To clear the idle bit (see Documentation/vm/idle_mem_tracking.txt)
+    > echo 6 > /proc/PID/clear_refs
+
 Any other value written to /proc/PID/clear_refs will have no effect.
 
 The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 081c49777abb..bab92cf7e2e4 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -14,6 +14,8 @@ hugetlbpage.txt
 	- a brief summary of hugetlbpage support in the Linux kernel.
 hwpoison.txt
 	- explains what hwpoison is
+idle_memory_tracking.txt
+	- explains how to track idle memory
 ksm.txt
 	- how to use the Kernel Samepage Merging feature.
 numa
diff --git a/Documentation/vm/idle_mem_tracking.txt b/Documentation/vm/idle_mem_tracking.txt
new file mode 100644
index 000000000000..4ca9bfafc560
--- /dev/null
+++ b/Documentation/vm/idle_mem_tracking.txt
@@ -0,0 +1,23 @@
+Idle memory tracking
+--------------------
+
+Knowing the portion of user memory that has not been touched for a given period
+of time can be useful to tune memory cgroup limits and/or for job placement
+within a compute cluster. CONFIG_IDLE_MEM_TRACKING provides the userspace with
+the means to estimate the amount of idle memory. In order to do this one should
+
+ 1. Write 1 to /proc/sys/vm/set_idle.
+
+    This will set the IDLE flag for all user pages. The IDLE flag is cleared
+    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
+    to the page. It is also cleared when the page is freed.
+
+ 2. Wait some time.
+
+ 3. Write 6 to /proc/PID/clear_refs for each PID of interest.
+
+    This will clear the IDLE flag for recently accessed pages.
+
+ 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
+    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
+    certain application/container.
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 1ddfa1367b03..4202e1d57d8c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are four components to pagemap:
     22. THP
     23. BALLOON
     24. ZERO_PAGE
+    25. IDLE
 
  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
    memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -114,6 +115,9 @@ Short descriptions to the page flags:
 24. ZERO_PAGE
     zero page for pfn_zero or huge_zero page
 
+25. IDLE
+    page is idle (see Documentation/vm/idle_mem_tracking.txt)
+
     [IO related page flags]
  1. ERROR     IO error occurred
  3. UPTODATE  page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..766478d66458 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -182,6 +182,10 @@ u64 stable_page_flags(struct page *page)
 	u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE,	PG_owner_priv_1);
 	u |= kpf_copy_bit(k, KPF_ARCH,		PG_arch_1);
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	u |= kpf_copy_bit(k, KPF_IDLE,		PG_idle);
+#endif
+
 	return u;
 };
 
@@ -275,6 +279,56 @@ static const struct file_operations proc_kpagecgroup_operations = {
 };
 #endif /* CONFIG_MEMCG */
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+static void set_mem_idle_node(int nid)
+{
+	unsigned long start_pfn, end_pfn, pfn;
+	struct page *page;
+
+	start_pfn = node_start_pfn(nid);
+	end_pfn = node_end_pfn(nid);
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		if (need_resched())
+			cond_resched();
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (page_count(page) == 0 || !PageLRU(page))
+			continue;
+
+		if (unlikely(!get_page_unless_zero(page)))
+			continue;
+		if (unlikely(!PageLRU(page)))
+			goto next;
+
+		SetPageIdle(page);
+next:
+		put_page(page);
+	}
+}
+
+static void set_mem_idle(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		set_mem_idle_node(nid);
+}
+
+int sysctl_set_mem_idle; /* unused */
+
+int sysctl_set_mem_idle_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		set_mem_idle();
+	return 0;
+}
+#endif /* CONFIG_IDLE_MEM_TRACKING */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 956b75d61809..b2b5ed1e10bb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
-	if (young || PageReferenced(page))
+	if (young || PageYoung(page) || PageReferenced(page))
 		mss->referenced += size;
 	mapcount = page_mapcount(page);
 	if (mapcount >= 2) {
@@ -733,6 +733,7 @@ enum clear_refs_types {
 	CLEAR_REFS_MAPPED,
 	CLEAR_REFS_SOFT_DIRTY,
 	CLEAR_REFS_MM_HIWATER_RSS,
+	CLEAR_REFS_IDLE,
 	CLEAR_REFS_LAST,
 };
 
@@ -806,6 +807,14 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 
 		page = pmd_page(*pmd);
 
+		if (cp->type == CLEAR_REFS_IDLE) {
+			if (pmdp_test_and_clear_young(vma, addr, pmd)) {
+				ClearPageIdle(page);
+				SetPageYoung(page);
+			}
+			goto out;
+		}
+
 		/* Clear accessed and referenced bits. */
 		pmdp_test_and_clear_young(vma, addr, pmd);
 		ClearPageReferenced(page);
@@ -833,6 +842,14 @@ out:
 		if (!page)
 			continue;
 
+		if (cp->type == CLEAR_REFS_IDLE) {
+			if (ptep_test_and_clear_young(vma, addr, pte)) {
+				ClearPageIdle(page);
+				SetPageYoung(page);
+			}
+			continue;
+		}
+
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
 		ClearPageReferenced(page);
@@ -852,10 +869,9 @@ static int clear_refs_test_walk(unsigned long start, unsigned long end,
 		return 1;
 
 	/*
-	 * Writing 1 to /proc/pid/clear_refs affects all pages.
+	 * Writing 1, 4, or 6 to /proc/pid/clear_refs affects all pages.
 	 * Writing 2 to /proc/pid/clear_refs only affects anonymous pages.
 	 * Writing 3 to /proc/pid/clear_refs only affects file mapped pages.
-	 * Writing 4 to /proc/pid/clear_refs affects all pages.
 	 */
 	if (cp->type == CLEAR_REFS_ANON && vma->vm_file)
 		return 1;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c851ff92d5b3..8e06d11eb723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	PG_young,
+	PG_idle,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -289,6 +293,14 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+#else
+PAGEFLAG_FALSE(Young)
+PAGEFLAG_FALSE(Idle)
+#endif
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
 #define KPF_THP			22
 #define KPF_BALLOON		23
 #define KPF_ZERO_PAGE		24
+#define KPF_IDLE		25
 
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c1552633e159..54b9a0aa290f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -112,6 +112,11 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+extern int sysctl_set_mem_idle;
+extern int sysctl_set_mem_idle_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos);
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -1512,6 +1517,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	{
+		.procname	= "set_idle",
+		.data		= &sysctl_set_mem_idle,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_set_mem_idle_handler,
+	},
+#endif
 	{ }
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 390214da4546..c6d5e931f62c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -635,3 +635,15 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+config IDLE_MEM_TRACKING
+	bool "Enable idle memory tracking"
+	depends on 64BIT
+	select PROC_PAGE_MONITOR
+	help
+	  This feature allows a userspace process to estimate the size of user
+	  memory that has not been touched during a given period of time. This
+	  information can be useful to tune memory cgroup limits and/or for job
+	  placement within a compute cluster.
+
+	  See Documentation/vm/idle_mem_tracking.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..88468485a1f3 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	{1UL << PG_young,		"young"		},
+	{1UL << PG_idle,		"idle"		},
+#endif
 };
 
 static void dump_flags(unsigned long flags,
diff --git a/mm/rmap.c b/mm/rmap.c
index 8030382bbf5f..1afcc4db31e0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -799,6 +799,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	if (referenced) {
 		pra->referenced++;
 		pra->vm_flags |= vma->vm_flags;
+		if (PageIdle(page))
+			ClearPageIdle(page);
+	}
+
+	if (PageYoung(page)) {
+		ClearPageYoung(page);
+		pra->referenced++;
 	}
 
 	if (dirty)
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e64cea9..94e591ecd64b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -615,6 +615,8 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	if (PageIdle(page))
+		ClearPageIdle(page);
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] mm: idle memory tracking
@ 2015-03-18 20:44   ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently. Currently, the only means to estimate the amount of idle
memory provided by the kernel is /proc/PID/clear_refs. However, it has
two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

This patch attempts to provide the userspace with the means to track
idle memory without the above mentioned limitations.

Usage:

 1. Write 1 to /proc/sys/vm/set_idle.

    This will set the IDLE flag for all user pages. The IDLE flag is cleared
    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
    to the page. It is also cleared when the page is freed.

 2. Wait some time.

 3. Write 6 to /proc/PID/clear_refs for each PID of interest.

    This will clear the IDLE flag for recently accessed pages.

 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
    certain application/container.

To avoid interference with the memory reclaimer, this patch adds the
PG_young flag in addition to PG_idle. The PG_young flag is set if the
ACCESS/YOUNG bit is cleared at step 3. page_referenced() returns >= 1 if
the page has the PG_young flag set.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/filesystems/proc.txt     |    3 ++
 Documentation/vm/00-INDEX              |    2 ++
 Documentation/vm/idle_mem_tracking.txt |   23 ++++++++++++++
 Documentation/vm/pagemap.txt           |    4 +++
 fs/proc/page.c                         |   54 ++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   22 +++++++++++--
 include/linux/page-flags.h             |   12 +++++++
 include/uapi/linux/kernel-page-flags.h |    1 +
 kernel/sysctl.c                        |   14 +++++++++
 mm/Kconfig                             |   12 +++++++
 mm/debug.c                             |    4 +++
 mm/rmap.c                              |    7 +++++
 mm/swap.c                              |    2 ++
 13 files changed, 157 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/vm/idle_mem_tracking.txt

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 8e36c7e3c345..9880ddb0383f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -500,6 +500,9 @@ To reset the peak resident set size ("high water mark") to the process's
 current value:
     > echo 5 > /proc/PID/clear_refs
 
+To clear the idle bit (see Documentation/vm/idle_mem_tracking.txt)
+    > echo 6 > /proc/PID/clear_refs
+
 Any other value written to /proc/PID/clear_refs will have no effect.
 
 The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 081c49777abb..bab92cf7e2e4 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -14,6 +14,8 @@ hugetlbpage.txt
 	- a brief summary of hugetlbpage support in the Linux kernel.
 hwpoison.txt
 	- explains what hwpoison is
+idle_memory_tracking.txt
+	- explains how to track idle memory
 ksm.txt
 	- how to use the Kernel Samepage Merging feature.
 numa
diff --git a/Documentation/vm/idle_mem_tracking.txt b/Documentation/vm/idle_mem_tracking.txt
new file mode 100644
index 000000000000..4ca9bfafc560
--- /dev/null
+++ b/Documentation/vm/idle_mem_tracking.txt
@@ -0,0 +1,23 @@
+Idle memory tracking
+--------------------
+
+Knowing the portion of user memory that has not been touched for a given period
+of time can be useful to tune memory cgroup limits and/or for job placement
+within a compute cluster. CONFIG_IDLE_MEM_TRACKING provides the userspace with
+the means to estimate the amount of idle memory. In order to do this one should
+
+ 1. Write 1 to /proc/sys/vm/set_idle.
+
+    This will set the IDLE flag for all user pages. The IDLE flag is cleared
+    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
+    to the page. It is also cleared when the page is freed.
+
+ 2. Wait some time.
+
+ 3. Write 6 to /proc/PID/clear_refs for each PID of interest.
+
+    This will clear the IDLE flag for recently accessed pages.
+
+ 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
+    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
+    certain application/container.
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 1ddfa1367b03..4202e1d57d8c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are four components to pagemap:
     22. THP
     23. BALLOON
     24. ZERO_PAGE
+    25. IDLE
 
  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
    memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -114,6 +115,9 @@ Short descriptions to the page flags:
 24. ZERO_PAGE
     zero page for pfn_zero or huge_zero page
 
+25. IDLE
+    page is idle (see Documentation/vm/idle_mem_tracking.txt)
+
     [IO related page flags]
  1. ERROR     IO error occurred
  3. UPTODATE  page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..766478d66458 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -182,6 +182,10 @@ u64 stable_page_flags(struct page *page)
 	u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE,	PG_owner_priv_1);
 	u |= kpf_copy_bit(k, KPF_ARCH,		PG_arch_1);
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	u |= kpf_copy_bit(k, KPF_IDLE,		PG_idle);
+#endif
+
 	return u;
 };
 
@@ -275,6 +279,56 @@ static const struct file_operations proc_kpagecgroup_operations = {
 };
 #endif /* CONFIG_MEMCG */
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+static void set_mem_idle_node(int nid)
+{
+	unsigned long start_pfn, end_pfn, pfn;
+	struct page *page;
+
+	start_pfn = node_start_pfn(nid);
+	end_pfn = node_end_pfn(nid);
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		if (need_resched())
+			cond_resched();
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (page_count(page) == 0 || !PageLRU(page))
+			continue;
+
+		if (unlikely(!get_page_unless_zero(page)))
+			continue;
+		if (unlikely(!PageLRU(page)))
+			goto next;
+
+		SetPageIdle(page);
+next:
+		put_page(page);
+	}
+}
+
+static void set_mem_idle(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		set_mem_idle_node(nid);
+}
+
+int sysctl_set_mem_idle; /* unused */
+
+int sysctl_set_mem_idle_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		set_mem_idle();
+	return 0;
+}
+#endif /* CONFIG_IDLE_MEM_TRACKING */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 956b75d61809..b2b5ed1e10bb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
-	if (young || PageReferenced(page))
+	if (young || PageYoung(page) || PageReferenced(page))
 		mss->referenced += size;
 	mapcount = page_mapcount(page);
 	if (mapcount >= 2) {
@@ -733,6 +733,7 @@ enum clear_refs_types {
 	CLEAR_REFS_MAPPED,
 	CLEAR_REFS_SOFT_DIRTY,
 	CLEAR_REFS_MM_HIWATER_RSS,
+	CLEAR_REFS_IDLE,
 	CLEAR_REFS_LAST,
 };
 
@@ -806,6 +807,14 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 
 		page = pmd_page(*pmd);
 
+		if (cp->type == CLEAR_REFS_IDLE) {
+			if (pmdp_test_and_clear_young(vma, addr, pmd)) {
+				ClearPageIdle(page);
+				SetPageYoung(page);
+			}
+			goto out;
+		}
+
 		/* Clear accessed and referenced bits. */
 		pmdp_test_and_clear_young(vma, addr, pmd);
 		ClearPageReferenced(page);
@@ -833,6 +842,14 @@ out:
 		if (!page)
 			continue;
 
+		if (cp->type == CLEAR_REFS_IDLE) {
+			if (ptep_test_and_clear_young(vma, addr, pte)) {
+				ClearPageIdle(page);
+				SetPageYoung(page);
+			}
+			continue;
+		}
+
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
 		ClearPageReferenced(page);
@@ -852,10 +869,9 @@ static int clear_refs_test_walk(unsigned long start, unsigned long end,
 		return 1;
 
 	/*
-	 * Writing 1 to /proc/pid/clear_refs affects all pages.
+	 * Writing 1, 4, or 6 to /proc/pid/clear_refs affects all pages.
 	 * Writing 2 to /proc/pid/clear_refs only affects anonymous pages.
 	 * Writing 3 to /proc/pid/clear_refs only affects file mapped pages.
-	 * Writing 4 to /proc/pid/clear_refs affects all pages.
 	 */
 	if (cp->type == CLEAR_REFS_ANON && vma->vm_file)
 		return 1;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c851ff92d5b3..8e06d11eb723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	PG_young,
+	PG_idle,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -289,6 +293,14 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+#else
+PAGEFLAG_FALSE(Young)
+PAGEFLAG_FALSE(Idle)
+#endif
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
 #define KPF_THP			22
 #define KPF_BALLOON		23
 #define KPF_ZERO_PAGE		24
+#define KPF_IDLE		25
 
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c1552633e159..54b9a0aa290f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -112,6 +112,11 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+extern int sysctl_set_mem_idle;
+extern int sysctl_set_mem_idle_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos);
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -1512,6 +1517,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	{
+		.procname	= "set_idle",
+		.data		= &sysctl_set_mem_idle,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_set_mem_idle_handler,
+	},
+#endif
 	{ }
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 390214da4546..c6d5e931f62c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -635,3 +635,15 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+config IDLE_MEM_TRACKING
+	bool "Enable idle memory tracking"
+	depends on 64BIT
+	select PROC_PAGE_MONITOR
+	help
+	  This feature allows a userspace process to estimate the size of user
+	  memory that has not been touched during a given period of time. This
+	  information can be useful to tune memory cgroup limits and/or for job
+	  placement within a compute cluster.
+
+	  See Documentation/vm/idle_mem_tracking.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..88468485a1f3 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	{1UL << PG_young,		"young"		},
+	{1UL << PG_idle,		"idle"		},
+#endif
 };
 
 static void dump_flags(unsigned long flags,
diff --git a/mm/rmap.c b/mm/rmap.c
index 8030382bbf5f..1afcc4db31e0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -799,6 +799,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	if (referenced) {
 		pra->referenced++;
 		pra->vm_flags |= vma->vm_flags;
+		if (PageIdle(page))
+			ClearPageIdle(page);
+	}
+
+	if (PageYoung(page)) {
+		ClearPageYoung(page);
+		pra->referenced++;
 	}
 
 	if (dirty)
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e64cea9..94e591ecd64b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -615,6 +615,8 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	if (PageIdle(page))
+		ClearPageIdle(page);
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] idle memory tracking
  2015-03-18 20:44 ` Vladimir Davydov
@ 2015-03-19  2:13   ` Minchan Kim
  -1 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2015-03-19  2:13 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
	linux-kernel

Hello,

On Wed, Mar 18, 2015 at 11:44:33PM +0300, Vladimir Davydov wrote:
> Hi,
> 
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently. Currently, the only means to estimate the amount of idle
> memory provided by the kernel is /proc/PID/clear_refs. However, it has
> two serious shortcomings:
> 
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
> 
> Back in 2011 an attempt was made by Michel Lespinasse to improve the
> situation (see http://lwn.net/Articles/459269/). He proposed a kernel
> space daemon which would periodically scan physical address range,
> testing and clearing ACCESS/YOUNG PTE bits, and counting pages that had
> not been referenced since the last scan. The daemon avoided interference
> with the page reclaimer by setting the new PG_young flag on referenced
> pages and making page_referenced() return >= 1 if PG_young was set.
> 
> This patch set reuses the idea of Michel's patch set, but the
> implementation is quite different. Instead of introducing a kernel space
> daemon, it only provides the userspace with the necessary means to
> estimate the amount of idle memory, leaving the daemon to be implemented
> in the userspace. In order to achieve that, it adds two new proc files,
> /proc/kpagecgroup and /proc/sys/vm/set_idle, and extends the clear_refs
> interface.
> 
> 
>  1. Write 1 to /proc/sys/vm/set_idle.
> 
>     This will set the IDLE flag for all user pages. The IDLE flag is cleared
>     when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
>     to the page. It is also cleared when the page is freed.

We should scan all of pages periodically? I understand why you did but
someone might not take care of unmapped pages so I hope it should be optional.
if someone just want to catch mapped file+anon pages, he can do it
by scanning of address space of the process he selects.
Even, someone might want to scan just part of address space rather than
all address space of the process. Acutally, I have such scenario.

> 
>  2. Wait some time.
> 
>  3. Write 6 to /proc/PID/clear_refs for each PID of interest.
> 
>     This will clear the IDLE flag for recently accessed pages.
> 
>  4. Count the number of idle pages as reported by /proc/kpageflags. One may use
>     /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
>     certain application/container.
> 

Adding two new page flags? I don't know it's okay for 64bit but there is no
room for 32bit. Please take care of 32 bit. It would be good feature for
embedded. How about using page_ext if you couldn't make room for page->flags
for 32bit? You would add per-page meta data in there.

Your suggestion is generic so my concern is overhead. On every iteration,
we should set/clear/investigate page flags. I don't know how much overhead
is in there but it surely could be big if memory is big.
Couldn't we do that at one go? Maybe, like mincore

        int idlecore(pid_t pid, void *addr, size_t length, unsigned char *vec)

So, we could know what pages of the process[pid] were idle by vec in
[addr, lentgh] and reset idle of the pages for the process
in the system call at one go.

Anyway, Thanks for the good feature.

> An example of using this new interface is below. It is a script that
> counts the number of pages charged to a specified cgroup that have not
> been accessed for a given time interval.
> 
> ---- BEGIN SCRIPT ----
> #! /usr/bin/python
> #
> 
> import struct
> import sys
> import os
> import stat
> import time
> 
> def get_end_pfn():
>     f = open("/proc/zoneinfo", "r")
>     end_pfn = 0
>     for l in f.readlines():
>         l = l.split()
>         if l[0] == "spanned":
>             end_pfn = int(l[1])
>         elif l[0] == "start_pfn:":
>             end_pfn += int(l[1])
>     return end_pfn
> 
> def set_idle():
>     open("/proc/sys/vm/set_idle", "w").write("1")
> 
> def clear_refs(target_cg_path):
>     procs = open(target_cg_path + "/cgroup.procs", "r")
>     for pid in procs.readlines():
>         try:
>             with open("/proc/" + pid.rstrip() + "/clear_refs", "w") as f:
>                 f.write("6")
>         except IOError as e:
>             print "Failed to clear_refs for pid " + pid + ": " + str(e)
> 
> def count_idle(target_cg_path):
>     target_cg_ino = os.stat(target_cg_path)[stat.ST_INO]
> 
>     pgflags = open("/proc/kpageflags", "rb")
>     pgcgroup = open("/proc/kpagecgroup", "rb")
> 
>     nidle = 0
> 
>     for i in range(0, get_end_pfn()):
>         cg_ino = struct.unpack('Q', pgcgroup.read(8))[0]
>         flags = struct.unpack('Q', pgflags.read(8))[0]
> 
>         if cg_ino != target_cg_ino:
>             continue
> 
>         # unevictable?
>         if (flags >> 18) & 1 != 0:
>             continue
> 
>         # huge?
>         if (flags >> 22) & 1 != 0:
>             npages = 512
>         else:
>             npages = 1
> 
>         # idle?
>         if (flags >> 25) & 1 != 0:
>             nidle += npages
> 
>     return nidle
> 
> if len(sys.argv) <> 3:
>     print "Usage: %s cgroup_path scan_interval" % sys.argv[0]
>     exit(1)
> 
> cg_path = sys.argv[1]
> scan_interval = int(sys.argv[2])
> 
> while True:
>     set_idle()
>     time.sleep(scan_interval)
>     clear_refs(cg_path)
>     print count_idle(cg_path)
> ---- END SCRIPT ----
> 
> Thanks,
> 
> Vladimir Davydov (3):
>   memcg: add page_cgroup_ino helper
>   proc: add kpagecgroup file
>   mm: idle memory tracking
> 
>  Documentation/filesystems/proc.txt     |    3 +
>  Documentation/vm/00-INDEX              |    2 +
>  Documentation/vm/idle_mem_tracking.txt |   23 +++++++
>  Documentation/vm/pagemap.txt           |   10 ++-
>  fs/proc/Kconfig                        |    5 +-
>  fs/proc/page.c                         |  107 ++++++++++++++++++++++++++++++++
>  fs/proc/task_mmu.c                     |   22 ++++++-
>  include/linux/memcontrol.h             |    8 +--
>  include/linux/page-flags.h             |   12 ++++
>  include/uapi/linux/kernel-page-flags.h |    1 +
>  kernel/sysctl.c                        |   14 +++++
>  mm/Kconfig                             |   12 ++++
>  mm/debug.c                             |    4 ++
>  mm/hwpoison-inject.c                   |    5 +-
>  mm/memcontrol.c                        |   61 ++++++++----------
>  mm/memory-failure.c                    |   16 +----
>  mm/rmap.c                              |    7 +++
>  mm/swap.c                              |    2 +
>  18 files changed, 248 insertions(+), 66 deletions(-)
>  create mode 100644 Documentation/vm/idle_mem_tracking.txt
> 
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] idle memory tracking
@ 2015-03-19  2:13   ` Minchan Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2015-03-19  2:13 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
	linux-kernel

Hello,

On Wed, Mar 18, 2015 at 11:44:33PM +0300, Vladimir Davydov wrote:
> Hi,
> 
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently. Currently, the only means to estimate the amount of idle
> memory provided by the kernel is /proc/PID/clear_refs. However, it has
> two serious shortcomings:
> 
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
> 
> Back in 2011 an attempt was made by Michel Lespinasse to improve the
> situation (see http://lwn.net/Articles/459269/). He proposed a kernel
> space daemon which would periodically scan physical address range,
> testing and clearing ACCESS/YOUNG PTE bits, and counting pages that had
> not been referenced since the last scan. The daemon avoided interference
> with the page reclaimer by setting the new PG_young flag on referenced
> pages and making page_referenced() return >= 1 if PG_young was set.
> 
> This patch set reuses the idea of Michel's patch set, but the
> implementation is quite different. Instead of introducing a kernel space
> daemon, it only provides the userspace with the necessary means to
> estimate the amount of idle memory, leaving the daemon to be implemented
> in the userspace. In order to achieve that, it adds two new proc files,
> /proc/kpagecgroup and /proc/sys/vm/set_idle, and extends the clear_refs
> interface.
> 
> 
>  1. Write 1 to /proc/sys/vm/set_idle.
> 
>     This will set the IDLE flag for all user pages. The IDLE flag is cleared
>     when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
>     to the page. It is also cleared when the page is freed.

We should scan all of pages periodically? I understand why you did but
someone might not take care of unmapped pages so I hope it should be optional.
if someone just want to catch mapped file+anon pages, he can do it
by scanning of address space of the process he selects.
Even, someone might want to scan just part of address space rather than
all address space of the process. Acutally, I have such scenario.

> 
>  2. Wait some time.
> 
>  3. Write 6 to /proc/PID/clear_refs for each PID of interest.
> 
>     This will clear the IDLE flag for recently accessed pages.
> 
>  4. Count the number of idle pages as reported by /proc/kpageflags. One may use
>     /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
>     certain application/container.
> 

Adding two new page flags? I don't know it's okay for 64bit but there is no
room for 32bit. Please take care of 32 bit. It would be good feature for
embedded. How about using page_ext if you couldn't make room for page->flags
for 32bit? You would add per-page meta data in there.

Your suggestion is generic so my concern is overhead. On every iteration,
we should set/clear/investigate page flags. I don't know how much overhead
is in there but it surely could be big if memory is big.
Couldn't we do that at one go? Maybe, like mincore

        int idlecore(pid_t pid, void *addr, size_t length, unsigned char *vec)

So, we could know what pages of the process[pid] were idle by vec in
[addr, lentgh] and reset idle of the pages for the process
in the system call at one go.

Anyway, Thanks for the good feature.

> An example of using this new interface is below. It is a script that
> counts the number of pages charged to a specified cgroup that have not
> been accessed for a given time interval.
> 
> ---- BEGIN SCRIPT ----
> #! /usr/bin/python
> #
> 
> import struct
> import sys
> import os
> import stat
> import time
> 
> def get_end_pfn():
>     f = open("/proc/zoneinfo", "r")
>     end_pfn = 0
>     for l in f.readlines():
>         l = l.split()
>         if l[0] == "spanned":
>             end_pfn = int(l[1])
>         elif l[0] == "start_pfn:":
>             end_pfn += int(l[1])
>     return end_pfn
> 
> def set_idle():
>     open("/proc/sys/vm/set_idle", "w").write("1")
> 
> def clear_refs(target_cg_path):
>     procs = open(target_cg_path + "/cgroup.procs", "r")
>     for pid in procs.readlines():
>         try:
>             with open("/proc/" + pid.rstrip() + "/clear_refs", "w") as f:
>                 f.write("6")
>         except IOError as e:
>             print "Failed to clear_refs for pid " + pid + ": " + str(e)
> 
> def count_idle(target_cg_path):
>     target_cg_ino = os.stat(target_cg_path)[stat.ST_INO]
> 
>     pgflags = open("/proc/kpageflags", "rb")
>     pgcgroup = open("/proc/kpagecgroup", "rb")
> 
>     nidle = 0
> 
>     for i in range(0, get_end_pfn()):
>         cg_ino = struct.unpack('Q', pgcgroup.read(8))[0]
>         flags = struct.unpack('Q', pgflags.read(8))[0]
> 
>         if cg_ino != target_cg_ino:
>             continue
> 
>         # unevictable?
>         if (flags >> 18) & 1 != 0:
>             continue
> 
>         # huge?
>         if (flags >> 22) & 1 != 0:
>             npages = 512
>         else:
>             npages = 1
> 
>         # idle?
>         if (flags >> 25) & 1 != 0:
>             nidle += npages
> 
>     return nidle
> 
> if len(sys.argv) <> 3:
>     print "Usage: %s cgroup_path scan_interval" % sys.argv[0]
>     exit(1)
> 
> cg_path = sys.argv[1]
> scan_interval = int(sys.argv[2])
> 
> while True:
>     set_idle()
>     time.sleep(scan_interval)
>     clear_refs(cg_path)
>     print count_idle(cg_path)
> ---- END SCRIPT ----
> 
> Thanks,
> 
> Vladimir Davydov (3):
>   memcg: add page_cgroup_ino helper
>   proc: add kpagecgroup file
>   mm: idle memory tracking
> 
>  Documentation/filesystems/proc.txt     |    3 +
>  Documentation/vm/00-INDEX              |    2 +
>  Documentation/vm/idle_mem_tracking.txt |   23 +++++++
>  Documentation/vm/pagemap.txt           |   10 ++-
>  fs/proc/Kconfig                        |    5 +-
>  fs/proc/page.c                         |  107 ++++++++++++++++++++++++++++++++
>  fs/proc/task_mmu.c                     |   22 ++++++-
>  include/linux/memcontrol.h             |    8 +--
>  include/linux/page-flags.h             |   12 ++++
>  include/uapi/linux/kernel-page-flags.h |    1 +
>  kernel/sysctl.c                        |   14 +++++
>  mm/Kconfig                             |   12 ++++
>  mm/debug.c                             |    4 ++
>  mm/hwpoison-inject.c                   |    5 +-
>  mm/memcontrol.c                        |   61 ++++++++----------
>  mm/memory-failure.c                    |   16 +----
>  mm/rmap.c                              |    7 +++
>  mm/swap.c                              |    2 +
>  18 files changed, 248 insertions(+), 66 deletions(-)
>  create mode 100644 Documentation/vm/idle_mem_tracking.txt
> 
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] idle memory tracking
  2015-03-19  2:13   ` Minchan Kim
@ 2015-03-19  8:08     ` Vladimir Davydov
  -1 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-19  8:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
	linux-kernel

On Thu, Mar 19, 2015 at 11:13:37AM +0900, Minchan Kim wrote:
> On Wed, Mar 18, 2015 at 11:44:33PM +0300, Vladimir Davydov wrote:
> >  1. Write 1 to /proc/sys/vm/set_idle.
> > 
> >     This will set the IDLE flag for all user pages. The IDLE flag is cleared
> >     when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
> >     to the page. It is also cleared when the page is freed.
> 
> We should scan all of pages periodically? I understand why you did but
> someone might not take care of unmapped pages so I hope it should be optional.
> if someone just want to catch mapped file+anon pages, he can do it
> by scanning of address space of the process he selects.
> Even, someone might want to scan just part of address space rather than
> all address space of the process. Acutally, I have such scenario.

You still can estimate the working set size of a particular process, or
even by a part of its address space, by setting the IDLE bit for all
user pages, but clearing refs for and analyzing only those pages you are
interested in. You can filter them by scanning /proc/PID/pagemap.

If you are concerned about performance, I don't think it would be an
issue: on my test machine setting the IDLE bit for 20 GB of user pages
takes about 150 ms. Provided that this kind of work is supposed to be
done relatively rarely (every several minutes or so), the overhead looks
negligible to me. Anyway, we can introduce /proc/PID/set_mem_idle for
setting the IDLE bit only on pages of a particular address space.

> 
> > 
> >  2. Wait some time.
> > 
> >  3. Write 6 to /proc/PID/clear_refs for each PID of interest.
> > 
> >     This will clear the IDLE flag for recently accessed pages.
> > 
> >  4. Count the number of idle pages as reported by /proc/kpageflags. One may use
> >     /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
> >     certain application/container.
> > 
> 
> Adding two new page flags? I don't know it's okay for 64bit but there is no
> room for 32bit. Please take care of 32 bit. It would be good feature for
> embedded. How about using page_ext if you couldn't make room for page->flags
> for 32bit? You would add per-page meta data in there.

For the time being, I made it dependant on 64BIT explicitly, because I
am only interested in analyzing working set size of containers running
on big machines, but I admit one could use page_ext for storing the
additional flags if compiled for 32 bit.

> 
> Your suggestion is generic so my concern is overhead. On every iteration,
> we should set/clear/investigate page flags. I don't know how much overhead
> is in there but it surely could be big if memory is big.
> Couldn't we do that at one go? Maybe, like mincore
> 
>         int idlecore(pid_t pid, void *addr, size_t length, unsigned char *vec)
> 
> So, we could know what pages of the process[pid] were idle by vec in
> [addr, lentgh] and reset idle of the pages for the process
> in the system call at one go.

I don't think adding yet another syscall for such a specialized feature
is a good idea. Besides, I want to keep the interface consistent with
/proc/PID/clear_refs, which IMO suits perfectly well for clearing the
IDLE flag on referenced pages. As I mentioned above, to reduce the
overhead in case the user is not interested in unmapped file pages, we
could introduce /proc/PID/set_mem_idle, though I think this only should
be done if there are complains about /proc/sys/vm/set_idle performance.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] idle memory tracking
@ 2015-03-19  8:08     ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-19  8:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
	linux-kernel

On Thu, Mar 19, 2015 at 11:13:37AM +0900, Minchan Kim wrote:
> On Wed, Mar 18, 2015 at 11:44:33PM +0300, Vladimir Davydov wrote:
> >  1. Write 1 to /proc/sys/vm/set_idle.
> > 
> >     This will set the IDLE flag for all user pages. The IDLE flag is cleared
> >     when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
> >     to the page. It is also cleared when the page is freed.
> 
> We should scan all of pages periodically? I understand why you did but
> someone might not take care of unmapped pages so I hope it should be optional.
> if someone just want to catch mapped file+anon pages, he can do it
> by scanning of address space of the process he selects.
> Even, someone might want to scan just part of address space rather than
> all address space of the process. Acutally, I have such scenario.

You still can estimate the working set size of a particular process, or
even by a part of its address space, by setting the IDLE bit for all
user pages, but clearing refs for and analyzing only those pages you are
interested in. You can filter them by scanning /proc/PID/pagemap.

If you are concerned about performance, I don't think it would be an
issue: on my test machine setting the IDLE bit for 20 GB of user pages
takes about 150 ms. Provided that this kind of work is supposed to be
done relatively rarely (every several minutes or so), the overhead looks
negligible to me. Anyway, we can introduce /proc/PID/set_mem_idle for
setting the IDLE bit only on pages of a particular address space.

> 
> > 
> >  2. Wait some time.
> > 
> >  3. Write 6 to /proc/PID/clear_refs for each PID of interest.
> > 
> >     This will clear the IDLE flag for recently accessed pages.
> > 
> >  4. Count the number of idle pages as reported by /proc/kpageflags. One may use
> >     /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
> >     certain application/container.
> > 
> 
> Adding two new page flags? I don't know it's okay for 64bit but there is no
> room for 32bit. Please take care of 32 bit. It would be good feature for
> embedded. How about using page_ext if you couldn't make room for page->flags
> for 32bit? You would add per-page meta data in there.

For the time being, I made it dependant on 64BIT explicitly, because I
am only interested in analyzing working set size of containers running
on big machines, but I admit one could use page_ext for storing the
additional flags if compiled for 32 bit.

> 
> Your suggestion is generic so my concern is overhead. On every iteration,
> we should set/clear/investigate page flags. I don't know how much overhead
> is in there but it surely could be big if memory is big.
> Couldn't we do that at one go? Maybe, like mincore
> 
>         int idlecore(pid_t pid, void *addr, size_t length, unsigned char *vec)
> 
> So, we could know what pages of the process[pid] were idle by vec in
> [addr, lentgh] and reset idle of the pages for the process
> in the system call at one go.

I don't think adding yet another syscall for such a specialized feature
is a good idea. Besides, I want to keep the interface consistent with
/proc/PID/clear_refs, which IMO suits perfectly well for clearing the
IDLE flag on referenced pages. As I mentioned above, to reduce the
overhead in case the user is not interested in unmapped file pages, we
could introduce /proc/PID/set_mem_idle, though I think this only should
be done if there are complains about /proc/sys/vm/set_idle performance.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] mm: idle memory tracking
  2015-03-18 20:44   ` Vladimir Davydov
@ 2015-03-19 10:12     ` Cyrill Gorcunov
  -1 siblings, 0 replies; 23+ messages in thread
From: Cyrill Gorcunov @ 2015-03-19 10:12 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

On Wed, Mar 18, 2015 at 11:44:36PM +0300, Vladimir Davydov wrote:
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently. Currently, the only means to estimate the amount of idle
> memory provided by the kernel is /proc/PID/clear_refs. However, it has
> two serious shortcomings:
> 
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
> 
> This patch attempts to provide the userspace with the means to track
> idle memory without the above mentioned limitations.
...
> +static void set_mem_idle(void)
> +{
> +	int nid;
> +
> +	for_each_online_node(nid)
> +		set_mem_idle_node(nid);
> +}

Vladimir, might we need get_online_mems/put_online_mems here,
or if node gets offline this wont be a problem? (Asking
because i don't know).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] mm: idle memory tracking
@ 2015-03-19 10:12     ` Cyrill Gorcunov
  0 siblings, 0 replies; 23+ messages in thread
From: Cyrill Gorcunov @ 2015-03-19 10:12 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

On Wed, Mar 18, 2015 at 11:44:36PM +0300, Vladimir Davydov wrote:
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently. Currently, the only means to estimate the amount of idle
> memory provided by the kernel is /proc/PID/clear_refs. However, it has
> two serious shortcomings:
> 
>  - it does not count unmapped file pages
>  - it affects the reclaimer logic
> 
> This patch attempts to provide the userspace with the means to track
> idle memory without the above mentioned limitations.
...
> +static void set_mem_idle(void)
> +{
> +	int nid;
> +
> +	for_each_online_node(nid)
> +		set_mem_idle_node(nid);
> +}

Vladimir, might we need get_online_mems/put_online_mems here,
or if node gets offline this wont be a problem? (Asking
because i don't know).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] mm: idle memory tracking
  2015-03-19 10:12     ` Cyrill Gorcunov
  (?)
@ 2015-03-19 10:41       ` Vladimir Davydov
  -1 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-19 10:41 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

On Thu, Mar 19, 2015 at 01:12:05PM +0300, Cyrill Gorcunov wrote:
> On Wed, Mar 18, 2015 at 11:44:36PM +0300, Vladimir Davydov wrote:
> > +static void set_mem_idle(void)
> > +{
> > +	int nid;
> > +
> > +	for_each_online_node(nid)
> > +		set_mem_idle_node(nid);
> > +}
> 
> Vladimir, might we need get_online_mems/put_online_mems here,
> or if node gets offline this wont be a problem? (Asking
> because i don't know).

I only need to dereference page structs corresponding to the node here,
and page structs are not freed when the node gets offline AFAICS, so I
guess it must be safe.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] mm: idle memory tracking
@ 2015-03-19 10:41       ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-19 10:41 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Jonathan Corbet, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Mar 19, 2015 at 01:12:05PM +0300, Cyrill Gorcunov wrote:
> On Wed, Mar 18, 2015 at 11:44:36PM +0300, Vladimir Davydov wrote:
> > +static void set_mem_idle(void)
> > +{
> > +	int nid;
> > +
> > +	for_each_online_node(nid)
> > +		set_mem_idle_node(nid);
> > +}
> 
> Vladimir, might we need get_online_mems/put_online_mems here,
> or if node gets offline this wont be a problem? (Asking
> because i don't know).

I only need to dereference page structs corresponding to the node here,
and page structs are not freed when the node gets offline AFAICS, so I
guess it must be safe.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] mm: idle memory tracking
@ 2015-03-19 10:41       ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-19 10:41 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

On Thu, Mar 19, 2015 at 01:12:05PM +0300, Cyrill Gorcunov wrote:
> On Wed, Mar 18, 2015 at 11:44:36PM +0300, Vladimir Davydov wrote:
> > +static void set_mem_idle(void)
> > +{
> > +	int nid;
> > +
> > +	for_each_online_node(nid)
> > +		set_mem_idle_node(nid);
> > +}
> 
> Vladimir, might we need get_online_mems/put_online_mems here,
> or if node gets offline this wont be a problem? (Asking
> because i don't know).

I only need to dereference page structs corresponding to the node here,
and page structs are not freed when the node gets offline AFAICS, so I
guess it must be safe.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] mm: idle memory tracking
  2015-03-19 10:41       ` Vladimir Davydov
@ 2015-03-19 10:45         ` Cyrill Gorcunov
  -1 siblings, 0 replies; 23+ messages in thread
From: Cyrill Gorcunov @ 2015-03-19 10:45 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

On Thu, Mar 19, 2015 at 01:41:03PM +0300, Vladimir Davydov wrote:
> > 
> > Vladimir, might we need get_online_mems/put_online_mems here,
> > or if node gets offline this wont be a problem? (Asking
> > because i don't know).
> 
> I only need to dereference page structs corresponding to the node here,
> and page structs are not freed when the node gets offline AFAICS, so I
> guess it must be safe.

OK, thanks for info!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] mm: idle memory tracking
@ 2015-03-19 10:45         ` Cyrill Gorcunov
  0 siblings, 0 replies; 23+ messages in thread
From: Cyrill Gorcunov @ 2015-03-19 10:45 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Jonathan Corbet, linux-api, linux-doc, linux-mm, linux-kernel

On Thu, Mar 19, 2015 at 01:41:03PM +0300, Vladimir Davydov wrote:
> > 
> > Vladimir, might we need get_online_mems/put_online_mems here,
> > or if node gets offline this wont be a problem? (Asking
> > because i don't know).
> 
> I only need to dereference page structs corresponding to the node here,
> and page structs are not freed when the node gets offline AFAICS, so I
> guess it must be safe.

OK, thanks for info!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] idle memory tracking
  2015-03-18 20:44 ` Vladimir Davydov
  (?)
@ 2015-03-24  7:45   ` Vladimir Davydov
  -1 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-24  7:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
	linux-kernel

On Wed, Mar 18, 2015 at 11:44:33PM +0300, Vladimir Davydov wrote:
> Usage:
> 
>  1. Write 1 to /proc/sys/vm/set_idle.
> 
>     This will set the IDLE flag for all user pages. The IDLE flag is cleared
>     when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
>     to the page. It is also cleared when the page is freed.
> 
>  2. Wait some time.
> 
>  3. Write 6 to /proc/PID/clear_refs for each PID of interest.
> 
>     This will clear the IDLE flag for recently accessed pages.
> 
>  4. Count the number of idle pages as reported by /proc/kpageflags. One may use
>     /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
>     certain application/container.

Any more thoughts on this? I am particularly interested in the user
interface. I think that /proc/kpagecgroup is OK, but I have my
reservations about using /proc/sys/vm/set_idle and /proc/PID/clear_refs
for setting and clearing the idle flag. The point is it is impossible to
scan memory for setting/clearing page idle flags in the background with
some predefined rate - one has to scan it all at once, which might
result in CPU load spikes on huge machines with TBs of RAM. May be, we'd
better introduce /proc/sys/vm/{set_idle,clear_refs_idle}, which would
receive pfn range to set/clear idle flags?

Any thoughts/ideas are more than welcome.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] idle memory tracking
@ 2015-03-24  7:45   ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-24  7:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
	linux-kernel

On Wed, Mar 18, 2015 at 11:44:33PM +0300, Vladimir Davydov wrote:
> Usage:
> 
>  1. Write 1 to /proc/sys/vm/set_idle.
> 
>     This will set the IDLE flag for all user pages. The IDLE flag is cleared
>     when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
>     to the page. It is also cleared when the page is freed.
> 
>  2. Wait some time.
> 
>  3. Write 6 to /proc/PID/clear_refs for each PID of interest.
> 
>     This will clear the IDLE flag for recently accessed pages.
> 
>  4. Count the number of idle pages as reported by /proc/kpageflags. One may use
>     /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
>     certain application/container.

Any more thoughts on this? I am particularly interested in the user
interface. I think that /proc/kpagecgroup is OK, but I have my
reservations about using /proc/sys/vm/set_idle and /proc/PID/clear_refs
for setting and clearing the idle flag. The point is it is impossible to
scan memory for setting/clearing page idle flags in the background with
some predefined rate - one has to scan it all at once, which might
result in CPU load spikes on huge machines with TBs of RAM. May be, we'd
better introduce /proc/sys/vm/{set_idle,clear_refs_idle}, which would
receive pfn range to set/clear idle flags?

Any thoughts/ideas are more than welcome.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/3] idle memory tracking
@ 2015-03-24  7:45   ` Vladimir Davydov
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Davydov @ 2015-03-24  7:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen,
	Michel Lespinasse, David Rientjes, Pavel Emelyanov,
	Cyrill Gorcunov, Jonathan Corbet, linux-api, linux-doc, linux-mm,
	linux-kernel

On Wed, Mar 18, 2015 at 11:44:33PM +0300, Vladimir Davydov wrote:
> Usage:
> 
>  1. Write 1 to /proc/sys/vm/set_idle.
> 
>     This will set the IDLE flag for all user pages. The IDLE flag is cleared
>     when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
>     to the page. It is also cleared when the page is freed.
> 
>  2. Wait some time.
> 
>  3. Write 6 to /proc/PID/clear_refs for each PID of interest.
> 
>     This will clear the IDLE flag for recently accessed pages.
> 
>  4. Count the number of idle pages as reported by /proc/kpageflags. One may use
>     /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
>     certain application/container.

Any more thoughts on this? I am particularly interested in the user
interface. I think that /proc/kpagecgroup is OK, but I have my
reservations about using /proc/sys/vm/set_idle and /proc/PID/clear_refs
for setting and clearing the idle flag. The point is it is impossible to
scan memory for setting/clearing page idle flags in the background with
some predefined rate - one has to scan it all at once, which might
result in CPU load spikes on huge machines with TBs of RAM. May be, we'd
better introduce /proc/sys/vm/{set_idle,clear_refs_idle}, which would
receive pfn range to set/clear idle flags?

Any thoughts/ideas are more than welcome.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-03-24  7:45 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-18 20:44 [PATCH 0/3] idle memory tracking Vladimir Davydov
2015-03-18 20:44 ` Vladimir Davydov
2015-03-18 20:44 ` [PATCH 1/3] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-03-18 20:44   ` Vladimir Davydov
2015-03-18 20:44   ` Vladimir Davydov
2015-03-18 20:44 ` [PATCH 2/3] proc: add kpagecgroup file Vladimir Davydov
2015-03-18 20:44   ` Vladimir Davydov
2015-03-18 20:44 ` [PATCH 3/3] mm: idle memory tracking Vladimir Davydov
2015-03-18 20:44   ` Vladimir Davydov
2015-03-19 10:12   ` Cyrill Gorcunov
2015-03-19 10:12     ` Cyrill Gorcunov
2015-03-19 10:41     ` Vladimir Davydov
2015-03-19 10:41       ` Vladimir Davydov
2015-03-19 10:41       ` Vladimir Davydov
2015-03-19 10:45       ` Cyrill Gorcunov
2015-03-19 10:45         ` Cyrill Gorcunov
2015-03-19  2:13 ` [PATCH 0/3] " Minchan Kim
2015-03-19  2:13   ` Minchan Kim
2015-03-19  8:08   ` Vladimir Davydov
2015-03-19  8:08     ` Vladimir Davydov
2015-03-24  7:45 ` Vladimir Davydov
2015-03-24  7:45   ` Vladimir Davydov
2015-03-24  7:45   ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.