All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] idle page tracking / working set estimation
@ 2011-09-17  3:39 ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Please comment on the following patches (which are against the v3.0 kernel).
We are using these to collect memory utilization statistics for each cgroup
accross many machines, and optimize job placement accordingly.

The statistics are intended to be compared accross many machines - we
don't just want to know which cgroup to reclaim from on an individual
machine, we also need to know which machine is best to target a job onto
within a large cluster. Also, we try to have a low impact on the normal
MM algorithms - we think they already do a fine job balancing resources
on individual machines, so we are not trying to mess up with that here.

Patch 1 introduces no functionality; it modifies the page_referenced API
so that it can be more easily extended in patch 3.

Patch 2 documents the proposed features, and adds a configuration option
for these. When the features are compiled in, they are still disabled
until the administrator sets up the desired scanning interval; however
the configuration option seems necessary as the features make use of
3 extra page flags - there is plenty of space for these in 64-bit builds,
but less so in 32-bit builds...

Patch 3 introduces page_referenced_kstaled(), which is similar to
page_referenced() but is used for idle page tracking rather than
for memory reclaimation. Since both functions clear the pte_young bits
and we don't want them to interfere with each other, two new page flags
are introduced that track when young pte references have been cleared by
each of the page_referenced variants. The page_referenced functions are also
extended to return the dirty status of any pte references encountered.

Patch 4 introduces the 'kstaled' thread that handles idle page tracking.
The thread starts disabled; one enables it by setting a scanning interval
in /sys/kernel/mm/kstaled/scan_seconds. It then scans all physical memory
pages, looking for idle pages - pages that have not been touched since the
previous scan interval. These pages are further classified into idle_clean
(which are immediately reclaimable), idle_dirty_swap (which are reclaimable
if swap is enabled on the system), and idle_dirty_file (which are reclaimable
after writeback occurs). These statistics are published for each cgroup in
a new /dev/cgroup/*/memory.idle_page_stats file. We did not use the
memory.stat file there because we thought these stats are different -
first, they are meaningless until one sets the scan_seconds value, and
then they are only updated once per scan interval where the memory.stat
values are continually updated.

Patch 5 is a small optimization skipping over memory holes.

Patch 6 rate limits the idle page scanning so that it occurs in small
chunks over the length of the scan interval, rather than all at once.

Patch 7 adds extra functionality to track how long a given page has been
idle, so that memory.idle_page_stats can report pages that have been
idle for 1,2,5,15,30,60,120 or 240 consecutive scan intervals.

Patch 8 adds extra functionality in the form of an incremental update
feature. Here we only report immediately reclaimable idle pages; however
we don't want to wait for the end of a scan interval to update this number
if the system experiences a rapid increase in memory pressure.

Michel Lespinasse (8):
  page_referenced: replace vm_flags parameter with struct pr_info
  kstaled: documentation and config option.
  kstaled: page_referenced_kstaled() and supporting infrastructure.
  kstaled: minimalistic implementation.
  kstaled: skip non-RAM regions.
  kstaled: rate limit pages scanned per second.
  kstaled: add histogram sampling functionality
  kstaled: add incrementally updating stale page count

 Documentation/cgroups/memory.txt  |  103 ++++++++-
 arch/x86/include/asm/page_types.h |    8 +
 arch/x86/kernel/e820.c            |   45 ++++
 include/linux/ksm.h               |    9 +-
 include/linux/mmzone.h            |   11 +
 include/linux/page-flags.h        |   50 ++++
 include/linux/pagemap.h           |   11 +-
 include/linux/rmap.h              |   82 ++++++-
 mm/Kconfig                        |   10 +
 mm/internal.h                     |    1 +
 mm/ksm.c                          |   15 +-
 mm/memcontrol.c                   |  492 +++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c               |    6 +
 mm/mlock.c                        |    1 +
 mm/rmap.c                         |  136 ++++++-----
 mm/swap.c                         |    1 +
 mm/vmscan.c                       |   20 +-
 17 files changed, 904 insertions(+), 97 deletions(-)

-- 
1.7.3.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 0/8] idle page tracking / working set estimation
@ 2011-09-17  3:39 ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Please comment on the following patches (which are against the v3.0 kernel).
We are using these to collect memory utilization statistics for each cgroup
accross many machines, and optimize job placement accordingly.

The statistics are intended to be compared accross many machines - we
don't just want to know which cgroup to reclaim from on an individual
machine, we also need to know which machine is best to target a job onto
within a large cluster. Also, we try to have a low impact on the normal
MM algorithms - we think they already do a fine job balancing resources
on individual machines, so we are not trying to mess up with that here.

Patch 1 introduces no functionality; it modifies the page_referenced API
so that it can be more easily extended in patch 3.

Patch 2 documents the proposed features, and adds a configuration option
for these. When the features are compiled in, they are still disabled
until the administrator sets up the desired scanning interval; however
the configuration option seems necessary as the features make use of
3 extra page flags - there is plenty of space for these in 64-bit builds,
but less so in 32-bit builds...

Patch 3 introduces page_referenced_kstaled(), which is similar to
page_referenced() but is used for idle page tracking rather than
for memory reclaimation. Since both functions clear the pte_young bits
and we don't want them to interfere with each other, two new page flags
are introduced that track when young pte references have been cleared by
each of the page_referenced variants. The page_referenced functions are also
extended to return the dirty status of any pte references encountered.

Patch 4 introduces the 'kstaled' thread that handles idle page tracking.
The thread starts disabled; one enables it by setting a scanning interval
in /sys/kernel/mm/kstaled/scan_seconds. It then scans all physical memory
pages, looking for idle pages - pages that have not been touched since the
previous scan interval. These pages are further classified into idle_clean
(which are immediately reclaimable), idle_dirty_swap (which are reclaimable
if swap is enabled on the system), and idle_dirty_file (which are reclaimable
after writeback occurs). These statistics are published for each cgroup in
a new /dev/cgroup/*/memory.idle_page_stats file. We did not use the
memory.stat file there because we thought these stats are different -
first, they are meaningless until one sets the scan_seconds value, and
then they are only updated once per scan interval where the memory.stat
values are continually updated.

Patch 5 is a small optimization skipping over memory holes.

Patch 6 rate limits the idle page scanning so that it occurs in small
chunks over the length of the scan interval, rather than all at once.

Patch 7 adds extra functionality to track how long a given page has been
idle, so that memory.idle_page_stats can report pages that have been
idle for 1,2,5,15,30,60,120 or 240 consecutive scan intervals.

Patch 8 adds extra functionality in the form of an incremental update
feature. Here we only report immediately reclaimable idle pages; however
we don't want to wait for the end of a scan interval to update this number
if the system experiences a rapid increase in memory pressure.

Michel Lespinasse (8):
  page_referenced: replace vm_flags parameter with struct pr_info
  kstaled: documentation and config option.
  kstaled: page_referenced_kstaled() and supporting infrastructure.
  kstaled: minimalistic implementation.
  kstaled: skip non-RAM regions.
  kstaled: rate limit pages scanned per second.
  kstaled: add histogram sampling functionality
  kstaled: add incrementally updating stale page count

 Documentation/cgroups/memory.txt  |  103 ++++++++-
 arch/x86/include/asm/page_types.h |    8 +
 arch/x86/kernel/e820.c            |   45 ++++
 include/linux/ksm.h               |    9 +-
 include/linux/mmzone.h            |   11 +
 include/linux/page-flags.h        |   50 ++++
 include/linux/pagemap.h           |   11 +-
 include/linux/rmap.h              |   82 ++++++-
 mm/Kconfig                        |   10 +
 mm/internal.h                     |    1 +
 mm/ksm.c                          |   15 +-
 mm/memcontrol.c                   |  492 +++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c               |    6 +
 mm/mlock.c                        |    1 +
 mm/rmap.c                         |  136 ++++++-----
 mm/swap.c                         |    1 +
 mm/vmscan.c                       |   20 +-
 17 files changed, 904 insertions(+), 97 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-17  3:39   ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Introduce struct pr_info, passed into page_referenced() family of functions,
to represent information about the pte references that have been found for
that page. Currently contains the vm_flags information as well as
a PR_REFERENCED flag. The idea is to make it easy to extend the API
with new flags.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/ksm.h  |    9 ++---
 include/linux/rmap.h |   28 ++++++++++-----
 mm/ksm.c             |   15 +++-----
 mm/rmap.c            |   92 +++++++++++++++++++++++---------------------------
 mm/vmscan.c          |   18 +++++----
 5 files changed, 81 insertions(+), 81 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3319a69..432c49b 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -83,8 +83,8 @@ static inline int ksm_might_need_to_copy(struct page *page,
 		 page->index != linear_page_index(vma, address));
 }
 
-int page_referenced_ksm(struct page *page,
-			struct mem_cgroup *memcg, unsigned long *vm_flags);
+void page_referenced_ksm(struct page *page,
+			struct mem_cgroup *memcg, struct pr_info *info);
 int try_to_unmap_ksm(struct page *page, enum ttu_flags flags);
 int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 		  struct vm_area_struct *, unsigned long, void *), void *arg);
@@ -119,10 +119,9 @@ static inline int ksm_might_need_to_copy(struct page *page,
 	return 0;
 }
 
-static inline int page_referenced_ksm(struct page *page,
-			struct mem_cgroup *memcg, unsigned long *vm_flags)
+static inline void page_referenced_ksm(struct page *page,
+			struct mem_cgroup *memcg, struct pr_info *info)
 {
-	return 0;
 }
 
 static inline int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..7c99c6f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -67,6 +67,15 @@ struct anon_vma_chain {
 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
 };
 
+/*
+ * Information to be filled by page_referenced() and friends.
+ */
+struct pr_info {
+	unsigned long vm_flags;
+	unsigned int pr_flags;
+#define PR_REFERENCED  1
+};
+
 #ifdef CONFIG_MMU
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
@@ -156,10 +165,11 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *cnt, unsigned long *vm_flags);
-int page_referenced_one(struct page *, struct vm_area_struct *,
-	unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
+void page_referenced(struct page *, int is_locked,
+		     struct mem_cgroup *cnt, struct pr_info *info);
+void page_referenced_one(struct page *, struct vm_area_struct *,
+			 unsigned long address, unsigned int *mapcount,
+			 struct pr_info *info);
 
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
@@ -234,12 +244,12 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 #define anon_vma_prepare(vma)	(0)
 #define anon_vma_link(vma)	do {} while (0)
 
-static inline int page_referenced(struct page *page, int is_locked,
-				  struct mem_cgroup *cnt,
-				  unsigned long *vm_flags)
+static inline void page_referenced(struct page *page, int is_locked,
+				   struct mem_cgroup *cnt,
+				   struct pr_info *info)
 {
-	*vm_flags = 0;
-	return 0;
+	info->vm_flags = 0;
+	info->pr_flags = 0;
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
diff --git a/mm/ksm.c b/mm/ksm.c
index 9a68b0c..5f540a4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1587,14 +1587,13 @@ struct page *ksm_does_need_to_copy(struct page *page,
 	return new_page;
 }
 
-int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
-			unsigned long *vm_flags)
+void page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
+			struct pr_info *info)
 {
 	struct stable_node *stable_node;
 	struct rmap_item *rmap_item;
 	struct hlist_node *hlist;
 	unsigned int mapcount = page_mapcount(page);
-	int referenced = 0;
 	int search_new_forks = 0;
 
 	VM_BUG_ON(!PageKsm(page));
@@ -1602,7 +1601,7 @@ int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
 
 	stable_node = page_stable_node(page);
 	if (!stable_node)
-		return 0;
+		return;
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
@@ -1627,19 +1626,17 @@ again:
 			if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
 				continue;
 
-			referenced += page_referenced_one(page, vma,
-				rmap_item->address, &mapcount, vm_flags);
+			page_referenced_one(page, vma, rmap_item->address,
+					    &mapcount, info);
 			if (!search_new_forks || !mapcount)
 				break;
 		}
 		anon_vma_unlock(anon_vma);
 		if (!mapcount)
-			goto out;
+			return;
 	}
 	if (!search_new_forks++)
 		goto again;
-out:
-	return referenced;
 }
 
 int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/mm/rmap.c b/mm/rmap.c
index 23295f6..6ff8ecf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -648,12 +648,12 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
  */
-int page_referenced_one(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, unsigned int *mapcount,
-			unsigned long *vm_flags)
+void page_referenced_one(struct page *page, struct vm_area_struct *vma,
+			 unsigned long address, unsigned int *mapcount,
+			 struct pr_info *info)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int referenced = 0;
+	bool referenced = false;
 
 	if (unlikely(PageTransHuge(page))) {
 		pmd_t *pmd;
@@ -667,19 +667,19 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 					     PAGE_CHECK_ADDRESS_PMD_FLAG);
 		if (!pmd) {
 			spin_unlock(&mm->page_table_lock);
-			goto out;
+			return;
 		}
 
 		if (vma->vm_flags & VM_LOCKED) {
 			spin_unlock(&mm->page_table_lock);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
-			goto out;
+			info->vm_flags |= VM_LOCKED;
+			return;
 		}
 
 		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
+			referenced = true;
 		spin_unlock(&mm->page_table_lock);
 	} else {
 		pte_t *pte;
@@ -691,13 +691,13 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		pte = page_check_address(page, mm, address, &ptl, 0);
 		if (!pte)
-			goto out;
+			return;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
-			goto out;
+			info->vm_flags |= VM_LOCKED;
+			return;
 		}
 
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
@@ -709,7 +709,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			 * set PG_referenced or activated the page.
 			 */
 			if (likely(!VM_SequentialReadHint(vma)))
-				referenced++;
+				referenced = true;
 		}
 		pte_unmap_unlock(pte, ptl);
 	}
@@ -718,28 +718,27 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	   swap token and is in the middle of a page fault. */
 	if (mm != current->mm && has_swap_token(mm) &&
 			rwsem_is_locked(&mm->mmap_sem))
-		referenced++;
+		referenced = true;
 
 	(*mapcount)--;
 
-	if (referenced)
-		*vm_flags |= vma->vm_flags;
-out:
-	return referenced;
+	if (referenced) {
+		info->vm_flags |= vma->vm_flags;
+		info->pr_flags |= PR_REFERENCED;
+	}
 }
 
-static int page_referenced_anon(struct page *page,
-				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+static void page_referenced_anon(struct page *page,
+				 struct mem_cgroup *mem_cont,
+				 struct pr_info *info)
 {
 	unsigned int mapcount;
 	struct anon_vma *anon_vma;
 	struct anon_vma_chain *avc;
-	int referenced = 0;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
-		return referenced;
+		return;
 
 	mapcount = page_mapcount(page);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
@@ -754,21 +753,20 @@ static int page_referenced_anon(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma, address,
-						  &mapcount, vm_flags);
+		page_referenced_one(page, vma, address, &mapcount, info);
 		if (!mapcount)
 			break;
 	}
 
 	page_unlock_anon_vma(anon_vma);
-	return referenced;
 }
 
 /**
  * page_referenced_file - referenced check for object-based rmap
  * @page: the page we're checking references on.
  * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @info: collect encountered vma->vm_flags who actually referenced the page
+ *        as well as flags describing the page references encountered.
  *
  * For an object-based mapped page, find all the places it is mapped and
  * check/clear the referenced flag.  This is done by following the page->mapping
@@ -777,16 +775,15 @@ static int page_referenced_anon(struct page *page,
  *
  * This function is only called from page_referenced for object-based pages.
  */
-static int page_referenced_file(struct page *page,
-				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+static void page_referenced_file(struct page *page,
+				 struct mem_cgroup *mem_cont,
+				 struct pr_info *info)
 {
 	unsigned int mapcount;
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
-	int referenced = 0;
 
 	/*
 	 * The caller's checks on page->mapping and !PageAnon have made
@@ -822,14 +819,12 @@ static int page_referenced_file(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma, address,
-						  &mapcount, vm_flags);
+		page_referenced_one(page, vma, address, &mapcount, info);
 		if (!mapcount)
 			break;
 	}
 
 	mutex_unlock(&mapping->i_mmap_mutex);
-	return referenced;
 }
 
 /**
@@ -837,45 +832,42 @@ static int page_referenced_file(struct page *page,
  * @page: the page to test
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @info: collect encountered vma->vm_flags who actually referenced the page
+ *        as well as flags describing the page references encountered.
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-int page_referenced(struct page *page,
-		    int is_locked,
-		    struct mem_cgroup *mem_cont,
-		    unsigned long *vm_flags)
+void page_referenced(struct page *page,
+		     int is_locked,
+		     struct mem_cgroup *mem_cont,
+		     struct pr_info *info)
 {
-	int referenced = 0;
 	int we_locked = 0;
 
-	*vm_flags = 0;
+	info->vm_flags = 0;
+	info->pr_flags = 0;
+
 	if (page_mapped(page) && page_rmapping(page)) {
 		if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
 			we_locked = trylock_page(page);
 			if (!we_locked) {
-				referenced++;
+				info->pr_flags |= PR_REFERENCED;
 				goto out;
 			}
 		}
 		if (unlikely(PageKsm(page)))
-			referenced += page_referenced_ksm(page, mem_cont,
-								vm_flags);
+			page_referenced_ksm(page, mem_cont, info);
 		else if (PageAnon(page))
-			referenced += page_referenced_anon(page, mem_cont,
-								vm_flags);
+			page_referenced_anon(page, mem_cont, info);
 		else if (page->mapping)
-			referenced += page_referenced_file(page, mem_cont,
-								vm_flags);
+			page_referenced_file(page, mem_cont, info);
 		if (we_locked)
 			unlock_page(page);
 	}
 out:
 	if (page_test_and_clear_young(page_to_pfn(page)))
-		referenced++;
-
-	return referenced;
+		info->pr_flags |= PR_REFERENCED;
 }
 
 static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d036e59..7bd9868 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -647,10 +647,10 @@ enum page_references {
 static enum page_references page_check_references(struct page *page,
 						  struct scan_control *sc)
 {
-	int referenced_ptes, referenced_page;
-	unsigned long vm_flags;
+	int referenced_page;
+	struct pr_info info;
 
-	referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+	page_referenced(page, 1, sc->mem_cgroup, &info);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -661,10 +661,10 @@ static enum page_references page_check_references(struct page *page,
 	 * Mlock lost the isolation race with us.  Let try_to_unmap()
 	 * move the page to the unevictable list.
 	 */
-	if (vm_flags & VM_LOCKED)
+	if (info.vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
-	if (referenced_ptes) {
+	if (info.pr_flags & PR_REFERENCED) {
 		if (PageAnon(page))
 			return PAGEREF_ACTIVATE;
 		/*
@@ -1535,7 +1535,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 {
 	unsigned long nr_taken;
 	unsigned long pgscanned;
-	unsigned long vm_flags;
+	struct pr_info info;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
@@ -1582,7 +1582,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			continue;
 		}
 
-		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		page_referenced(page, 0, sc->mem_cgroup, &info);
+		if (info.pr_flags & PR_REFERENCED) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1593,7 +1594,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			 * IO, plus JVM can create lots of anon VM_EXEC pages,
 			 * so we ignore them here.
 			 */
-			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
+			if ((info.vm_flags & VM_EXEC) &&
+			    page_is_file_cache(page)) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
@ 2011-09-17  3:39   ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Introduce struct pr_info, passed into page_referenced() family of functions,
to represent information about the pte references that have been found for
that page. Currently contains the vm_flags information as well as
a PR_REFERENCED flag. The idea is to make it easy to extend the API
with new flags.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/ksm.h  |    9 ++---
 include/linux/rmap.h |   28 ++++++++++-----
 mm/ksm.c             |   15 +++-----
 mm/rmap.c            |   92 +++++++++++++++++++++++---------------------------
 mm/vmscan.c          |   18 +++++----
 5 files changed, 81 insertions(+), 81 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3319a69..432c49b 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -83,8 +83,8 @@ static inline int ksm_might_need_to_copy(struct page *page,
 		 page->index != linear_page_index(vma, address));
 }
 
-int page_referenced_ksm(struct page *page,
-			struct mem_cgroup *memcg, unsigned long *vm_flags);
+void page_referenced_ksm(struct page *page,
+			struct mem_cgroup *memcg, struct pr_info *info);
 int try_to_unmap_ksm(struct page *page, enum ttu_flags flags);
 int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 		  struct vm_area_struct *, unsigned long, void *), void *arg);
@@ -119,10 +119,9 @@ static inline int ksm_might_need_to_copy(struct page *page,
 	return 0;
 }
 
-static inline int page_referenced_ksm(struct page *page,
-			struct mem_cgroup *memcg, unsigned long *vm_flags)
+static inline void page_referenced_ksm(struct page *page,
+			struct mem_cgroup *memcg, struct pr_info *info)
 {
-	return 0;
 }
 
 static inline int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..7c99c6f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -67,6 +67,15 @@ struct anon_vma_chain {
 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
 };
 
+/*
+ * Information to be filled by page_referenced() and friends.
+ */
+struct pr_info {
+	unsigned long vm_flags;
+	unsigned int pr_flags;
+#define PR_REFERENCED  1
+};
+
 #ifdef CONFIG_MMU
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
@@ -156,10 +165,11 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *cnt, unsigned long *vm_flags);
-int page_referenced_one(struct page *, struct vm_area_struct *,
-	unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
+void page_referenced(struct page *, int is_locked,
+		     struct mem_cgroup *cnt, struct pr_info *info);
+void page_referenced_one(struct page *, struct vm_area_struct *,
+			 unsigned long address, unsigned int *mapcount,
+			 struct pr_info *info);
 
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
@@ -234,12 +244,12 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 #define anon_vma_prepare(vma)	(0)
 #define anon_vma_link(vma)	do {} while (0)
 
-static inline int page_referenced(struct page *page, int is_locked,
-				  struct mem_cgroup *cnt,
-				  unsigned long *vm_flags)
+static inline void page_referenced(struct page *page, int is_locked,
+				   struct mem_cgroup *cnt,
+				   struct pr_info *info)
 {
-	*vm_flags = 0;
-	return 0;
+	info->vm_flags = 0;
+	info->pr_flags = 0;
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
diff --git a/mm/ksm.c b/mm/ksm.c
index 9a68b0c..5f540a4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1587,14 +1587,13 @@ struct page *ksm_does_need_to_copy(struct page *page,
 	return new_page;
 }
 
-int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
-			unsigned long *vm_flags)
+void page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
+			struct pr_info *info)
 {
 	struct stable_node *stable_node;
 	struct rmap_item *rmap_item;
 	struct hlist_node *hlist;
 	unsigned int mapcount = page_mapcount(page);
-	int referenced = 0;
 	int search_new_forks = 0;
 
 	VM_BUG_ON(!PageKsm(page));
@@ -1602,7 +1601,7 @@ int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
 
 	stable_node = page_stable_node(page);
 	if (!stable_node)
-		return 0;
+		return;
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
@@ -1627,19 +1626,17 @@ again:
 			if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
 				continue;
 
-			referenced += page_referenced_one(page, vma,
-				rmap_item->address, &mapcount, vm_flags);
+			page_referenced_one(page, vma, rmap_item->address,
+					    &mapcount, info);
 			if (!search_new_forks || !mapcount)
 				break;
 		}
 		anon_vma_unlock(anon_vma);
 		if (!mapcount)
-			goto out;
+			return;
 	}
 	if (!search_new_forks++)
 		goto again;
-out:
-	return referenced;
 }
 
 int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/mm/rmap.c b/mm/rmap.c
index 23295f6..6ff8ecf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -648,12 +648,12 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
  */
-int page_referenced_one(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, unsigned int *mapcount,
-			unsigned long *vm_flags)
+void page_referenced_one(struct page *page, struct vm_area_struct *vma,
+			 unsigned long address, unsigned int *mapcount,
+			 struct pr_info *info)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int referenced = 0;
+	bool referenced = false;
 
 	if (unlikely(PageTransHuge(page))) {
 		pmd_t *pmd;
@@ -667,19 +667,19 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 					     PAGE_CHECK_ADDRESS_PMD_FLAG);
 		if (!pmd) {
 			spin_unlock(&mm->page_table_lock);
-			goto out;
+			return;
 		}
 
 		if (vma->vm_flags & VM_LOCKED) {
 			spin_unlock(&mm->page_table_lock);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
-			goto out;
+			info->vm_flags |= VM_LOCKED;
+			return;
 		}
 
 		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
+			referenced = true;
 		spin_unlock(&mm->page_table_lock);
 	} else {
 		pte_t *pte;
@@ -691,13 +691,13 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		pte = page_check_address(page, mm, address, &ptl, 0);
 		if (!pte)
-			goto out;
+			return;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
-			goto out;
+			info->vm_flags |= VM_LOCKED;
+			return;
 		}
 
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
@@ -709,7 +709,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			 * set PG_referenced or activated the page.
 			 */
 			if (likely(!VM_SequentialReadHint(vma)))
-				referenced++;
+				referenced = true;
 		}
 		pte_unmap_unlock(pte, ptl);
 	}
@@ -718,28 +718,27 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	   swap token and is in the middle of a page fault. */
 	if (mm != current->mm && has_swap_token(mm) &&
 			rwsem_is_locked(&mm->mmap_sem))
-		referenced++;
+		referenced = true;
 
 	(*mapcount)--;
 
-	if (referenced)
-		*vm_flags |= vma->vm_flags;
-out:
-	return referenced;
+	if (referenced) {
+		info->vm_flags |= vma->vm_flags;
+		info->pr_flags |= PR_REFERENCED;
+	}
 }
 
-static int page_referenced_anon(struct page *page,
-				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+static void page_referenced_anon(struct page *page,
+				 struct mem_cgroup *mem_cont,
+				 struct pr_info *info)
 {
 	unsigned int mapcount;
 	struct anon_vma *anon_vma;
 	struct anon_vma_chain *avc;
-	int referenced = 0;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
-		return referenced;
+		return;
 
 	mapcount = page_mapcount(page);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
@@ -754,21 +753,20 @@ static int page_referenced_anon(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma, address,
-						  &mapcount, vm_flags);
+		page_referenced_one(page, vma, address, &mapcount, info);
 		if (!mapcount)
 			break;
 	}
 
 	page_unlock_anon_vma(anon_vma);
-	return referenced;
 }
 
 /**
  * page_referenced_file - referenced check for object-based rmap
  * @page: the page we're checking references on.
  * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @info: collect encountered vma->vm_flags who actually referenced the page
+ *        as well as flags describing the page references encountered.
  *
  * For an object-based mapped page, find all the places it is mapped and
  * check/clear the referenced flag.  This is done by following the page->mapping
@@ -777,16 +775,15 @@ static int page_referenced_anon(struct page *page,
  *
  * This function is only called from page_referenced for object-based pages.
  */
-static int page_referenced_file(struct page *page,
-				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+static void page_referenced_file(struct page *page,
+				 struct mem_cgroup *mem_cont,
+				 struct pr_info *info)
 {
 	unsigned int mapcount;
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
-	int referenced = 0;
 
 	/*
 	 * The caller's checks on page->mapping and !PageAnon have made
@@ -822,14 +819,12 @@ static int page_referenced_file(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma, address,
-						  &mapcount, vm_flags);
+		page_referenced_one(page, vma, address, &mapcount, info);
 		if (!mapcount)
 			break;
 	}
 
 	mutex_unlock(&mapping->i_mmap_mutex);
-	return referenced;
 }
 
 /**
@@ -837,45 +832,42 @@ static int page_referenced_file(struct page *page,
  * @page: the page to test
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @info: collect encountered vma->vm_flags who actually referenced the page
+ *        as well as flags describing the page references encountered.
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-int page_referenced(struct page *page,
-		    int is_locked,
-		    struct mem_cgroup *mem_cont,
-		    unsigned long *vm_flags)
+void page_referenced(struct page *page,
+		     int is_locked,
+		     struct mem_cgroup *mem_cont,
+		     struct pr_info *info)
 {
-	int referenced = 0;
 	int we_locked = 0;
 
-	*vm_flags = 0;
+	info->vm_flags = 0;
+	info->pr_flags = 0;
+
 	if (page_mapped(page) && page_rmapping(page)) {
 		if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
 			we_locked = trylock_page(page);
 			if (!we_locked) {
-				referenced++;
+				info->pr_flags |= PR_REFERENCED;
 				goto out;
 			}
 		}
 		if (unlikely(PageKsm(page)))
-			referenced += page_referenced_ksm(page, mem_cont,
-								vm_flags);
+			page_referenced_ksm(page, mem_cont, info);
 		else if (PageAnon(page))
-			referenced += page_referenced_anon(page, mem_cont,
-								vm_flags);
+			page_referenced_anon(page, mem_cont, info);
 		else if (page->mapping)
-			referenced += page_referenced_file(page, mem_cont,
-								vm_flags);
+			page_referenced_file(page, mem_cont, info);
 		if (we_locked)
 			unlock_page(page);
 	}
 out:
 	if (page_test_and_clear_young(page_to_pfn(page)))
-		referenced++;
-
-	return referenced;
+		info->pr_flags |= PR_REFERENCED;
 }
 
 static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d036e59..7bd9868 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -647,10 +647,10 @@ enum page_references {
 static enum page_references page_check_references(struct page *page,
 						  struct scan_control *sc)
 {
-	int referenced_ptes, referenced_page;
-	unsigned long vm_flags;
+	int referenced_page;
+	struct pr_info info;
 
-	referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+	page_referenced(page, 1, sc->mem_cgroup, &info);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -661,10 +661,10 @@ static enum page_references page_check_references(struct page *page,
 	 * Mlock lost the isolation race with us.  Let try_to_unmap()
 	 * move the page to the unevictable list.
 	 */
-	if (vm_flags & VM_LOCKED)
+	if (info.vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
-	if (referenced_ptes) {
+	if (info.pr_flags & PR_REFERENCED) {
 		if (PageAnon(page))
 			return PAGEREF_ACTIVATE;
 		/*
@@ -1535,7 +1535,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 {
 	unsigned long nr_taken;
 	unsigned long pgscanned;
-	unsigned long vm_flags;
+	struct pr_info info;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
@@ -1582,7 +1582,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			continue;
 		}
 
-		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		page_referenced(page, 0, sc->mem_cgroup, &info);
+		if (info.pr_flags & PR_REFERENCED) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1593,7 +1594,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			 * IO, plus JVM can create lots of anon VM_EXEC pages,
 			 * so we ignore them here.
 			 */
-			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
+			if ((info.vm_flags & VM_EXEC) &&
+			    page_is_file_cache(page)) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/8] kstaled: documentation and config option.
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-17  3:39   ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Extend memory cgroup documentation do describe the optional idle page
tracking features, and add the corresponding configuration option.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 Documentation/cgroups/memory.txt |  103 +++++++++++++++++++++++++++++++++++++-
 mm/Kconfig                       |   10 ++++
 2 files changed, 112 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 06eb6d9..7ee2eb3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -672,7 +672,108 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
-11. TODO
+
+11. Idle page tracking
+
+Idle page tracking works by scanning physical memory at a known rate,
+finding idle pages, and accounting for them in the cgroup owning them.
+
+Idle pages are defined as user pages (either anon or file backed) that have
+not been accessed for a number of consecutive scans, and are also not
+currently pinned down (for example by being mlocked).
+
+11.1 Usage
+
+The first step is to select the global scanning rate:
+
+# echo 120 > /sys/kernel/mm/kstaled/scan_seconds	# 2 minutes per scan
+
+(At boot time, the default value for /sys/kernel/mm/kstaled/scan_seconds
+is 0 which means the idle page tracking feature is disabled).
+
+Then, the per-cgroup memory.idle_page_stats files get updated at the
+end of every scan. The relevant fields are:
+* idle_clean: idle pages that have been untouched for at least one scan cycle,
+  and are also clean. Being clean and unpinned, such pages are immediately
+  reclaimable by the MM's LRU algorithms.
+* idle_dirty_file: idle pages that have been untouched for at least one
+  scan cycle, are dirty, and are file backed. Such pages are not immediately
+  reclaimable as writeback needs to occur first.
+* idle_dirty_swap: idle pages that have been untouched for at least one
+  scan cycle, are dirty, and would have to be written to swap before being
+  reclaimed. This includes dirty anon memory, tmpfs files and shm segments.
+  Note that such pages are counted as idle_dirty_swap regardless of whether
+  swap is enabled or not on the system.
+* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
+  above, but for pages that have been untouched for at least two scan cycles.
+* these fields repeat up to idle_240_clean, idle_240_dirty_file and
+  idle_240_dirty_swap, allowing one to observe idle pages over a variety
+  of idle interval lengths. Note that the accounting is cumulative:
+  pages counted as idle for a given interval length are also counted
+  as idle for smaller interval lengths.
+* scans: number of physical memory scans since the cgroup was created.
+
+All the above fields are updated exactly once per scan.
+
+11.2 Responsiveness guarantees
+
+After a user page stops being touched and/or pinned, it takes at least one
+scan cycle for that page to be considered as idle and accounted as such
+in one of the idle_clean / idle_dirty_file / idle_dirty_swap counts
+(or, n scan cycles for the page to be accounted as idle in one of the
+idle_N_clean / idle_N_dirty_file / idle_N_dirty_swap counts).
+
+However, there is no guarantee that pages will be detected that fast.
+In the worst case, it could take up to two extra scan cycle intervals
+for a page to be accounted as idle. This is because after userspace stops
+touching the page, it may take up to one scan interval before we next
+scan it (at which point the page will be seen as not idle yet since it
+was touched during the previous scan) and after the page is finally scanned
+again and detected as idle, it may take up to one extra scan interval before
+completing the physical memory scan and exporting the updated statistics.
+
+Conversely, when userspace touches or pins a page that was previously
+accounted for as idle, it may take up to two scan intervals before the
+corresponding statistics are updated. Once again, this is because it may
+take up to one scan interval before scanning the page and finding it not
+idle anymore, and up to one extra scan interval before completing the
+physical memory scan and exporting the updated statistics.
+
+11.3 Incremental idle page tracking
+
+In some situations, it is desired to obtain faster feedback when
+previously idle, clean user pages start being touched. Remember that
+unpinned clean pages are immediately reclaimable by the MM's LRU
+algorithms. A high number of such pages being idle in a given cgroup
+indicates that this cgroup is not experiencing high memory pressure.
+A decrease of that number can be seen as a leading indicator that
+memory pressure is about to increase, and it may be desired to act
+upon that indication before the two scan interval measurement delay.
+
+The incremental idle page tracking feature can be used for that case.
+It allows for tracking of idle clean pages only, and only for a
+predetermined number of scan intervals (no histogram functionality as
+in the main interface).
+
+The desired idle period must first be selected on a per-cgroup basis
+by writing an integer to the memory.stale_page_age file. The integer
+is the interval we want pages to be idle for, expressed in scan cycles.
+For example to check for pages that have been idle for 5 consecutive
+scan cycles (equivalent to the idle_5_clean statistic), one would
+write 5 to the memory.stale_page_age file. The default value for the
+memory.stale_page_age file is 0, which disables the incremental idle
+page tracking feature.
+
+During scanning, clean unpinned pages that have not been touched for the
+chosen number of scan cycles are incrementally accounted for and reflected
+in the "stale" statistic in memory.idle_page_stats. Likewise, pages that
+were previously accounted as stale and are found not to be idle anymore
+are also incrementally accounted for. Additionally, any pages that are
+being considered by the LRU replacement algorithm and found to have been
+touched are also incrementally accounted for.
+
+
+12. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/mm/Kconfig b/mm/Kconfig
index 8ca47a5..3406a39 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -370,3 +370,13 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config KSTALED
+       depends on CGROUP_MEM_RES_CTLR
+       bool "Per-cgroup idle page tracking"
+       help
+         This feature allows the kernel to report the amount of user pages
+	 in a cgroup that have not been touched in a given time.
+	 This information may be used to size the cgroups and/or for
+	 job placement within a compute cluster.
+	 See Documentation/cgroups/memory.txt for a more complete description.
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/8] kstaled: documentation and config option.
@ 2011-09-17  3:39   ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Extend memory cgroup documentation do describe the optional idle page
tracking features, and add the corresponding configuration option.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 Documentation/cgroups/memory.txt |  103 +++++++++++++++++++++++++++++++++++++-
 mm/Kconfig                       |   10 ++++
 2 files changed, 112 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 06eb6d9..7ee2eb3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -672,7 +672,108 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
-11. TODO
+
+11. Idle page tracking
+
+Idle page tracking works by scanning physical memory at a known rate,
+finding idle pages, and accounting for them in the cgroup owning them.
+
+Idle pages are defined as user pages (either anon or file backed) that have
+not been accessed for a number of consecutive scans, and are also not
+currently pinned down (for example by being mlocked).
+
+11.1 Usage
+
+The first step is to select the global scanning rate:
+
+# echo 120 > /sys/kernel/mm/kstaled/scan_seconds	# 2 minutes per scan
+
+(At boot time, the default value for /sys/kernel/mm/kstaled/scan_seconds
+is 0 which means the idle page tracking feature is disabled).
+
+Then, the per-cgroup memory.idle_page_stats files get updated at the
+end of every scan. The relevant fields are:
+* idle_clean: idle pages that have been untouched for at least one scan cycle,
+  and are also clean. Being clean and unpinned, such pages are immediately
+  reclaimable by the MM's LRU algorithms.
+* idle_dirty_file: idle pages that have been untouched for at least one
+  scan cycle, are dirty, and are file backed. Such pages are not immediately
+  reclaimable as writeback needs to occur first.
+* idle_dirty_swap: idle pages that have been untouched for at least one
+  scan cycle, are dirty, and would have to be written to swap before being
+  reclaimed. This includes dirty anon memory, tmpfs files and shm segments.
+  Note that such pages are counted as idle_dirty_swap regardless of whether
+  swap is enabled or not on the system.
+* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
+  above, but for pages that have been untouched for at least two scan cycles.
+* these fields repeat up to idle_240_clean, idle_240_dirty_file and
+  idle_240_dirty_swap, allowing one to observe idle pages over a variety
+  of idle interval lengths. Note that the accounting is cumulative:
+  pages counted as idle for a given interval length are also counted
+  as idle for smaller interval lengths.
+* scans: number of physical memory scans since the cgroup was created.
+
+All the above fields are updated exactly once per scan.
+
+11.2 Responsiveness guarantees
+
+After a user page stops being touched and/or pinned, it takes at least one
+scan cycle for that page to be considered as idle and accounted as such
+in one of the idle_clean / idle_dirty_file / idle_dirty_swap counts
+(or, n scan cycles for the page to be accounted as idle in one of the
+idle_N_clean / idle_N_dirty_file / idle_N_dirty_swap counts).
+
+However, there is no guarantee that pages will be detected that fast.
+In the worst case, it could take up to two extra scan cycle intervals
+for a page to be accounted as idle. This is because after userspace stops
+touching the page, it may take up to one scan interval before we next
+scan it (at which point the page will be seen as not idle yet since it
+was touched during the previous scan) and after the page is finally scanned
+again and detected as idle, it may take up to one extra scan interval before
+completing the physical memory scan and exporting the updated statistics.
+
+Conversely, when userspace touches or pins a page that was previously
+accounted for as idle, it may take up to two scan intervals before the
+corresponding statistics are updated. Once again, this is because it may
+take up to one scan interval before scanning the page and finding it not
+idle anymore, and up to one extra scan interval before completing the
+physical memory scan and exporting the updated statistics.
+
+11.3 Incremental idle page tracking
+
+In some situations, it is desired to obtain faster feedback when
+previously idle, clean user pages start being touched. Remember that
+unpinned clean pages are immediately reclaimable by the MM's LRU
+algorithms. A high number of such pages being idle in a given cgroup
+indicates that this cgroup is not experiencing high memory pressure.
+A decrease of that number can be seen as a leading indicator that
+memory pressure is about to increase, and it may be desired to act
+upon that indication before the two scan interval measurement delay.
+
+The incremental idle page tracking feature can be used for that case.
+It allows for tracking of idle clean pages only, and only for a
+predetermined number of scan intervals (no histogram functionality as
+in the main interface).
+
+The desired idle period must first be selected on a per-cgroup basis
+by writing an integer to the memory.stale_page_age file. The integer
+is the interval we want pages to be idle for, expressed in scan cycles.
+For example to check for pages that have been idle for 5 consecutive
+scan cycles (equivalent to the idle_5_clean statistic), one would
+write 5 to the memory.stale_page_age file. The default value for the
+memory.stale_page_age file is 0, which disables the incremental idle
+page tracking feature.
+
+During scanning, clean unpinned pages that have not been touched for the
+chosen number of scan cycles are incrementally accounted for and reflected
+in the "stale" statistic in memory.idle_page_stats. Likewise, pages that
+were previously accounted as stale and are found not to be idle anymore
+are also incrementally accounted for. Additionally, any pages that are
+being considered by the LRU replacement algorithm and found to have been
+touched are also incrementally accounted for.
+
+
+12. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/mm/Kconfig b/mm/Kconfig
index 8ca47a5..3406a39 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -370,3 +370,13 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config KSTALED
+       depends on CGROUP_MEM_RES_CTLR
+       bool "Per-cgroup idle page tracking"
+       help
+         This feature allows the kernel to report the amount of user pages
+	 in a cgroup that have not been touched in a given time.
+	 This information may be used to size the cgroups and/or for
+	 job placement within a compute cluster.
+	 See Documentation/cgroups/memory.txt for a more complete description.
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/8] kstaled: page_referenced_kstaled() and supporting infrastructure.
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-17  3:39   ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Add a new page_referenced_kstaled() interface. The desired behavior
is that page_referenced() returns page references since the last
page_referenced() call, and page_referenced_kstaled() returns page
references since the last page_referenced_kstaled() call, but they
are both independent of each other and do not influence each other.

The following events are counted as kstaled page references:
- CPU data access to the page (as noticed through pte_young());
- mark_page_accessed() calls;
- page being freed / reallocated.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/page-flags.h |   35 ++++++++++++++++++++++
 include/linux/rmap.h       |   68 +++++++++++++++++++++++++++++++++++++++----
 mm/rmap.c                  |   60 +++++++++++++++++++++++++++-----------
 mm/swap.c                  |    1 +
 4 files changed, 139 insertions(+), 25 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6081493..e964d98 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -51,6 +51,13 @@
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
+ *
+ * PG_young indicates that kstaled cleared the young bit on some PTEs pointing
+ * to that page. In order to avoid interacting with the LRU algorithm, we want
+ * the next page_referenced() call to still consider the page young.
+ *
+ * PG_idle indicates that the page has not been referenced since the last time
+ * kstaled scanned it.
  */
 
 /*
@@ -107,6 +114,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_KSTALED
+	PG_young,		/* kstaled cleared pte_young */
+	PG_idle,		/* idle since start of kstaled interval */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -278,6 +289,30 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#ifdef CONFIG_KSTALED
+
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+
+static inline void set_page_young(struct page *page)
+{
+	if (!PageYoung(page))
+		SetPageYoung(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	if (PageIdle(page))
+		ClearPageIdle(page);
+}
+
+#else /* !CONFIG_KSTALED */
+
+static inline void set_page_young(struct page *page) {}
+static inline void clear_page_idle(struct page *page) {}
+
+#endif /* CONFIG_KSTALED */
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7c99c6f..27ca023 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -74,6 +74,8 @@ struct pr_info {
 	unsigned long vm_flags;
 	unsigned int pr_flags;
 #define PR_REFERENCED  1
+#define PR_DIRTY       2
+#define PR_FOR_KSTALED 4
 };
 
 #ifdef CONFIG_MMU
@@ -165,8 +167,8 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-void page_referenced(struct page *, int is_locked,
-		     struct mem_cgroup *cnt, struct pr_info *info);
+void __page_referenced(struct page *, int is_locked,
+		       struct mem_cgroup *cnt, struct pr_info *info);
 void page_referenced_one(struct page *, struct vm_area_struct *,
 			 unsigned long address, unsigned int *mapcount,
 			 struct pr_info *info);
@@ -244,12 +246,10 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 #define anon_vma_prepare(vma)	(0)
 #define anon_vma_link(vma)	do {} while (0)
 
-static inline void page_referenced(struct page *page, int is_locked,
-				   struct mem_cgroup *cnt,
-				   struct pr_info *info)
+static inline void __page_referenced(struct page *page, int is_locked,
+				     struct mem_cgroup *cnt,
+				     struct pr_info *info)
 {
-	info->vm_flags = 0;
-	info->pr_flags = 0;
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
@@ -262,6 +262,60 @@ static inline int page_mkclean(struct page *page)
 
 #endif	/* CONFIG_MMU */
 
+/**
+ * page_referenced - test if the page was referenced
+ * @page: the page to test
+ * @is_locked: caller holds lock on the page
+ * @mem_cont: target memory controller
+ * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ *
+ * Quick test_and_clear_referenced for all mappings to a page,
+ * returns the number of ptes which referenced the page.
+ */
+static inline void page_referenced(struct page *page,
+				   int is_locked,
+				   struct mem_cgroup *mem_cont,
+				   struct pr_info *info)
+{
+	info->vm_flags = 0;
+	info->pr_flags = 0;
+
+#ifdef CONFIG_KSTALED
+	/*
+	 * Always clear PageYoung at the start of a scanning interval. It will
+	 * get get set if kstaled clears a young bit in a pte reference,
+	 * so that vmscan will still see the page as referenced.
+	 */
+	if (PageYoung(page)) {
+		ClearPageYoung(page);
+		info->pr_flags |= PR_REFERENCED;
+	}
+#endif
+
+	__page_referenced(page, is_locked, mem_cont, info);
+}
+
+#ifdef CONFIG_KSTALED
+static inline void page_referenced_kstaled(struct page *page, bool is_locked,
+					   struct pr_info *info)
+{
+	info->vm_flags = 0;
+	info->pr_flags = PR_FOR_KSTALED;
+
+	/*
+	 * Always set PageIdle at the start of a scanning interval. It will
+	 * get cleared if a young page reference is encountered; otherwise
+	 * the page will be counted as idle at the next kstaled scan cycle.
+	 */
+	if (!PageIdle(page)) {
+		SetPageIdle(page);
+		info->pr_flags |= PR_REFERENCED;
+	}
+
+	__page_referenced(page, is_locked, NULL, info);
+}
+#endif
+
 /*
  * Return values of try_to_unmap
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index 6ff8ecf..91f6d9c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -678,8 +678,17 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		/* go ahead even if the pmd is pmd_trans_splitting() */
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced = true;
+		if (!(info->pr_flags & PR_FOR_KSTALED)) {
+			if (pmdp_clear_flush_young_notify(vma, address, pmd)) {
+				referenced = true;
+				clear_page_idle(page);
+			}
+		} else {
+			if (pmdp_test_and_clear_young(vma, address, pmd)) {
+				referenced = true;
+				set_page_young(page);
+			}
+		}
 		spin_unlock(&mm->page_table_lock);
 	} else {
 		pte_t *pte;
@@ -693,6 +702,9 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (!pte)
 			return;
 
+		if (pte_dirty(*pte))
+			info->pr_flags |= PR_DIRTY;
+
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
@@ -700,23 +712,38 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return;
 		}
 
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+		if (!(info->pr_flags & PR_FOR_KSTALED)) {
+			if (ptep_clear_flush_young_notify(vma, address, pte)) {
+				/*
+				 * Don't treat a reference through a
+				 * sequentially read mapping as such.
+				 * If the page has been used in another
+				 * mapping, we will catch it; if this other
+				 * mapping is already gone, the unmap path
+				 * will have set PG_referenced or activated
+				 * the page.
+				 */
+				if (likely(!VM_SequentialReadHint(vma)))
+					referenced = true;
+				clear_page_idle(page);
+			}
+		} else {
 			/*
-			 * Don't treat a reference through a sequentially read
-			 * mapping as such.  If the page has been used in
-			 * another mapping, we will catch it; if this other
-			 * mapping is already gone, the unmap path will have
-			 * set PG_referenced or activated the page.
+			 * Within page_referenced_kstaled():
+			 * skip TLB shootdown & VM_SequentialReadHint heuristic
 			 */
-			if (likely(!VM_SequentialReadHint(vma)))
+			if (ptep_test_and_clear_young(vma, address, pte)) {
 				referenced = true;
+				set_page_young(page);
+			}
 		}
 		pte_unmap_unlock(pte, ptl);
 	}
 
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
-	if (mm != current->mm && has_swap_token(mm) &&
+	if (!(info->pr_flags & PR_FOR_KSTALED) &&
+			mm != current->mm && has_swap_token(mm) &&
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced = true;
 
@@ -828,7 +855,7 @@ static void page_referenced_file(struct page *page,
 }
 
 /**
- * page_referenced - test if the page was referenced
+ * __page_referenced - test if the page was referenced
  * @page: the page to test
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
@@ -838,16 +865,13 @@ static void page_referenced_file(struct page *page,
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-void page_referenced(struct page *page,
-		     int is_locked,
-		     struct mem_cgroup *mem_cont,
-		     struct pr_info *info)
+void __page_referenced(struct page *page,
+		       int is_locked,
+		       struct mem_cgroup *mem_cont,
+		       struct pr_info *info)
 {
 	int we_locked = 0;
 
-	info->vm_flags = 0;
-	info->pr_flags = 0;
-
 	if (page_mapped(page) && page_rmapping(page)) {
 		if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
 			we_locked = trylock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..d65b69e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -344,6 +344,7 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	clear_page_idle(page);
 }
 
 EXPORT_SYMBOL(mark_page_accessed);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/8] kstaled: page_referenced_kstaled() and supporting infrastructure.
@ 2011-09-17  3:39   ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Add a new page_referenced_kstaled() interface. The desired behavior
is that page_referenced() returns page references since the last
page_referenced() call, and page_referenced_kstaled() returns page
references since the last page_referenced_kstaled() call, but they
are both independent of each other and do not influence each other.

The following events are counted as kstaled page references:
- CPU data access to the page (as noticed through pte_young());
- mark_page_accessed() calls;
- page being freed / reallocated.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/page-flags.h |   35 ++++++++++++++++++++++
 include/linux/rmap.h       |   68 +++++++++++++++++++++++++++++++++++++++----
 mm/rmap.c                  |   60 +++++++++++++++++++++++++++-----------
 mm/swap.c                  |    1 +
 4 files changed, 139 insertions(+), 25 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6081493..e964d98 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -51,6 +51,13 @@
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
+ *
+ * PG_young indicates that kstaled cleared the young bit on some PTEs pointing
+ * to that page. In order to avoid interacting with the LRU algorithm, we want
+ * the next page_referenced() call to still consider the page young.
+ *
+ * PG_idle indicates that the page has not been referenced since the last time
+ * kstaled scanned it.
  */
 
 /*
@@ -107,6 +114,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_KSTALED
+	PG_young,		/* kstaled cleared pte_young */
+	PG_idle,		/* idle since start of kstaled interval */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -278,6 +289,30 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#ifdef CONFIG_KSTALED
+
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+
+static inline void set_page_young(struct page *page)
+{
+	if (!PageYoung(page))
+		SetPageYoung(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	if (PageIdle(page))
+		ClearPageIdle(page);
+}
+
+#else /* !CONFIG_KSTALED */
+
+static inline void set_page_young(struct page *page) {}
+static inline void clear_page_idle(struct page *page) {}
+
+#endif /* CONFIG_KSTALED */
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7c99c6f..27ca023 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -74,6 +74,8 @@ struct pr_info {
 	unsigned long vm_flags;
 	unsigned int pr_flags;
 #define PR_REFERENCED  1
+#define PR_DIRTY       2
+#define PR_FOR_KSTALED 4
 };
 
 #ifdef CONFIG_MMU
@@ -165,8 +167,8 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-void page_referenced(struct page *, int is_locked,
-		     struct mem_cgroup *cnt, struct pr_info *info);
+void __page_referenced(struct page *, int is_locked,
+		       struct mem_cgroup *cnt, struct pr_info *info);
 void page_referenced_one(struct page *, struct vm_area_struct *,
 			 unsigned long address, unsigned int *mapcount,
 			 struct pr_info *info);
@@ -244,12 +246,10 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 #define anon_vma_prepare(vma)	(0)
 #define anon_vma_link(vma)	do {} while (0)
 
-static inline void page_referenced(struct page *page, int is_locked,
-				   struct mem_cgroup *cnt,
-				   struct pr_info *info)
+static inline void __page_referenced(struct page *page, int is_locked,
+				     struct mem_cgroup *cnt,
+				     struct pr_info *info)
 {
-	info->vm_flags = 0;
-	info->pr_flags = 0;
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
@@ -262,6 +262,60 @@ static inline int page_mkclean(struct page *page)
 
 #endif	/* CONFIG_MMU */
 
+/**
+ * page_referenced - test if the page was referenced
+ * @page: the page to test
+ * @is_locked: caller holds lock on the page
+ * @mem_cont: target memory controller
+ * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ *
+ * Quick test_and_clear_referenced for all mappings to a page,
+ * returns the number of ptes which referenced the page.
+ */
+static inline void page_referenced(struct page *page,
+				   int is_locked,
+				   struct mem_cgroup *mem_cont,
+				   struct pr_info *info)
+{
+	info->vm_flags = 0;
+	info->pr_flags = 0;
+
+#ifdef CONFIG_KSTALED
+	/*
+	 * Always clear PageYoung at the start of a scanning interval. It will
+	 * get get set if kstaled clears a young bit in a pte reference,
+	 * so that vmscan will still see the page as referenced.
+	 */
+	if (PageYoung(page)) {
+		ClearPageYoung(page);
+		info->pr_flags |= PR_REFERENCED;
+	}
+#endif
+
+	__page_referenced(page, is_locked, mem_cont, info);
+}
+
+#ifdef CONFIG_KSTALED
+static inline void page_referenced_kstaled(struct page *page, bool is_locked,
+					   struct pr_info *info)
+{
+	info->vm_flags = 0;
+	info->pr_flags = PR_FOR_KSTALED;
+
+	/*
+	 * Always set PageIdle at the start of a scanning interval. It will
+	 * get cleared if a young page reference is encountered; otherwise
+	 * the page will be counted as idle at the next kstaled scan cycle.
+	 */
+	if (!PageIdle(page)) {
+		SetPageIdle(page);
+		info->pr_flags |= PR_REFERENCED;
+	}
+
+	__page_referenced(page, is_locked, NULL, info);
+}
+#endif
+
 /*
  * Return values of try_to_unmap
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index 6ff8ecf..91f6d9c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -678,8 +678,17 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		/* go ahead even if the pmd is pmd_trans_splitting() */
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced = true;
+		if (!(info->pr_flags & PR_FOR_KSTALED)) {
+			if (pmdp_clear_flush_young_notify(vma, address, pmd)) {
+				referenced = true;
+				clear_page_idle(page);
+			}
+		} else {
+			if (pmdp_test_and_clear_young(vma, address, pmd)) {
+				referenced = true;
+				set_page_young(page);
+			}
+		}
 		spin_unlock(&mm->page_table_lock);
 	} else {
 		pte_t *pte;
@@ -693,6 +702,9 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (!pte)
 			return;
 
+		if (pte_dirty(*pte))
+			info->pr_flags |= PR_DIRTY;
+
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
@@ -700,23 +712,38 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return;
 		}
 
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+		if (!(info->pr_flags & PR_FOR_KSTALED)) {
+			if (ptep_clear_flush_young_notify(vma, address, pte)) {
+				/*
+				 * Don't treat a reference through a
+				 * sequentially read mapping as such.
+				 * If the page has been used in another
+				 * mapping, we will catch it; if this other
+				 * mapping is already gone, the unmap path
+				 * will have set PG_referenced or activated
+				 * the page.
+				 */
+				if (likely(!VM_SequentialReadHint(vma)))
+					referenced = true;
+				clear_page_idle(page);
+			}
+		} else {
 			/*
-			 * Don't treat a reference through a sequentially read
-			 * mapping as such.  If the page has been used in
-			 * another mapping, we will catch it; if this other
-			 * mapping is already gone, the unmap path will have
-			 * set PG_referenced or activated the page.
+			 * Within page_referenced_kstaled():
+			 * skip TLB shootdown & VM_SequentialReadHint heuristic
 			 */
-			if (likely(!VM_SequentialReadHint(vma)))
+			if (ptep_test_and_clear_young(vma, address, pte)) {
 				referenced = true;
+				set_page_young(page);
+			}
 		}
 		pte_unmap_unlock(pte, ptl);
 	}
 
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
-	if (mm != current->mm && has_swap_token(mm) &&
+	if (!(info->pr_flags & PR_FOR_KSTALED) &&
+			mm != current->mm && has_swap_token(mm) &&
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced = true;
 
@@ -828,7 +855,7 @@ static void page_referenced_file(struct page *page,
 }
 
 /**
- * page_referenced - test if the page was referenced
+ * __page_referenced - test if the page was referenced
  * @page: the page to test
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
@@ -838,16 +865,13 @@ static void page_referenced_file(struct page *page,
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-void page_referenced(struct page *page,
-		     int is_locked,
-		     struct mem_cgroup *mem_cont,
-		     struct pr_info *info)
+void __page_referenced(struct page *page,
+		       int is_locked,
+		       struct mem_cgroup *mem_cont,
+		       struct pr_info *info)
 {
 	int we_locked = 0;
 
-	info->vm_flags = 0;
-	info->pr_flags = 0;
-
 	if (page_mapped(page) && page_rmapping(page)) {
 		if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
 			we_locked = trylock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..d65b69e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -344,6 +344,7 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	clear_page_idle(page);
 }
 
 EXPORT_SYMBOL(mark_page_accessed);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/8] kstaled: minimalistic implementation.
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-17  3:39   ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Introduce minimal kstaled implementation. The scan rate is controlled by
/sys/kernel/mm/kstaled/scan_seconds and per-cgroup statistics are output
into /dev/cgroup/*/memory.idle_page_stats.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memcontrol.c |  291 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 291 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..aebd45a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,8 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/kthread.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -283,6 +285,16 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+#ifdef CONFIG_KSTALED
+	seqcount_t idle_page_stats_lock;
+	struct idle_page_stats {
+		unsigned long idle_clean;
+		unsigned long idle_dirty_file;
+		unsigned long idle_dirty_swap;
+	} idle_page_stats, idle_scan_stats;
+	unsigned long idle_page_scans;
+#endif
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4668,6 +4680,30 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 }
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_KSTALED
+static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	unsigned int seqcount;
+	struct idle_page_stats stats;
+	unsigned long scans;
+
+	do {
+		seqcount = read_seqcount_begin(&mem->idle_page_stats_lock);
+		stats = mem->idle_page_stats;
+		scans = mem->idle_page_scans;
+	} while (read_seqcount_retry(&mem->idle_page_stats_lock, seqcount));
+
+	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
+	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
+	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
+	cb->fill(cb, "scans", scans);
+
+	return 0;
+}
+#endif /* CONFIG_KSTALED */
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4738,6 +4774,12 @@ static struct cftype mem_cgroup_files[] = {
 		.mode = S_IRUGO,
 	},
 #endif
+#ifdef CONFIG_KSTALED
+	{
+		.name = "idle_page_stats",
+		.read_map = mem_cgroup_idle_page_stats_read,
+	},
+#endif
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -5001,6 +5043,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
+#ifdef CONFIG_KSTALED
+	seqcount_init(&mem->idle_page_stats_lock);
+#endif
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
@@ -5568,3 +5613,249 @@ static int __init enable_swap_account(char *s)
 __setup("swapaccount=", enable_swap_account);
 
 #endif
+
+#ifdef CONFIG_KSTALED
+
+static unsigned int kstaled_scan_seconds;
+static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
+
+static inline void kstaled_scan_page(struct page *page)
+{
+	bool is_locked = false;
+	bool is_file;
+	struct pr_info info;
+	struct page_cgroup *pc;
+	struct idle_page_stats *stats;
+
+	/*
+	 * Before taking the page reference, check if the page is
+	 * a user page which is not obviously unreclaimable
+	 * (we will do more complete checks later).
+	 */
+	if (!PageLRU(page) || PageMlocked(page) ||
+	    (page->mapping == NULL && !PageSwapCache(page)))
+		return;
+
+	if (!get_page_unless_zero(page))
+		return;
+
+	/* Recheck now that we have the page reference. */
+	if (unlikely(!PageLRU(page) || PageMlocked(page)))
+		goto out;
+
+	/*
+	 * Anon and SwapCache pages can be identified without locking.
+	 * For all other cases, we need the page locked in order to
+	 * dereference page->mapping.
+	 */
+	if (PageAnon(page) || PageSwapCache(page))
+		is_file = false;
+	else if (!trylock_page(page)) {
+		/*
+		 * We need to lock the page to dereference the mapping.
+		 * But don't risk sleeping by calling lock_page().
+		 * We don't want to stall kstaled, so we conservatively
+		 * count locked pages as unreclaimable.
+		 */
+		goto out;
+	} else {
+		struct address_space *mapping = page->mapping;
+
+		is_locked = true;
+
+		/*
+		 * The page is still anon - it has been continuously referenced
+		 * since the prior check.
+		 */
+		VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
+
+		/*
+		 * Check the mapping under protection of the page lock.
+		 * 1. If the page is not swap cache and has no mapping,
+		 *    shrink_page_list can't do anything with it.
+		 * 2. If the mapping is unevictable (as in SHM_LOCK segments),
+		 *    shrink_page_list can't do anything with it.
+		 * 3. If the page is swap cache or the mapping is swap backed
+		 *    (as in shmem), consider it a swappable page.
+		 * 4. If the backing dev has indicated that it does not want
+		 *    its pages sync'd to disk (as in ramfs), take this as
+		 *    a hint that its pages are not reclaimable.
+		 * 5. Otherwise, consider this as a file page reclaimable
+		 *    through standard pageout.
+		 */
+		if (!mapping && !PageSwapCache(page))
+			goto out;
+		else if (mapping_unevictable(mapping))
+			goto out;
+		else if (PageSwapCache(page) ||
+			 mapping_cap_swap_backed(mapping))
+			is_file = false;
+		else if (!mapping_cap_writeback_dirty(mapping))
+			goto out;
+		else
+			is_file = true;
+	}
+
+	/* Find out if the page is idle. Also test for pending mlock. */
+	page_referenced_kstaled(page, is_locked, &info);
+	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
+		goto out;
+
+	/* Locate kstaled stats for the page's cgroup. */
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		goto out;
+	lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc))
+		goto unlock_page_cgroup_out;
+	stats = &pc->mem_cgroup->idle_scan_stats;
+
+	/* Finally increment the correct statistic for this page. */
+	if (!(info.pr_flags & PR_DIRTY) &&
+	    !PageDirty(page) && !PageWriteback(page))
+		stats->idle_clean++;
+	else if (is_file)
+		stats->idle_dirty_file++;
+	else
+		stats->idle_dirty_swap++;
+
+ unlock_page_cgroup_out:
+	unlock_page_cgroup(pc);
+
+ out:
+	if (is_locked)
+		unlock_page(page);
+	put_page(page);
+}
+
+static void kstaled_scan_node(pg_data_t *pgdat)
+{
+	unsigned long flags;
+	unsigned long start, end, pfn;
+
+	pgdat_resize_lock(pgdat, &flags);
+
+	start = pgdat->node_start_pfn;
+	end = start + pgdat->node_spanned_pages;
+
+	for (pfn = start; pfn < end; pfn++) {
+		if (need_resched()) {
+			pgdat_resize_unlock(pgdat, &flags);
+			cond_resched();
+			pgdat_resize_lock(pgdat, &flags);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+			/* abort if the node got resized */
+			if (pfn < pgdat->node_start_pfn ||
+			    end > (pgdat->node_start_pfn +
+				   pgdat->node_spanned_pages))
+				goto abort;
+#endif
+		}
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		kstaled_scan_page(pfn_to_page(pfn));
+	}
+
+abort:
+	pgdat_resize_unlock(pgdat, &flags);
+}
+
+static int kstaled(void *dummy)
+{
+	while (1) {
+		int scan_seconds;
+		int nid;
+		struct mem_cgroup *mem;
+
+		wait_event_interruptible(kstaled_wait,
+				 (scan_seconds = kstaled_scan_seconds) > 0);
+		/*
+		 * We use interruptible wait_event so as not to contribute
+		 * to the machine load average while we're sleeping.
+		 * However, we don't actually expect to receive a signal
+		 * since we run as a kernel thread, so the condition we were
+		 * waiting for should be true once we get here.
+		 */
+		BUG_ON(scan_seconds <= 0);
+
+		for_each_mem_cgroup_all(mem)
+			memset(&mem->idle_scan_stats, 0,
+			       sizeof(mem->idle_scan_stats));
+
+		for_each_node_state(nid, N_HIGH_MEMORY)
+			kstaled_scan_node(NODE_DATA(nid));
+
+		for_each_mem_cgroup_all(mem) {
+			write_seqcount_begin(&mem->idle_page_stats_lock);
+			mem->idle_page_stats = mem->idle_scan_stats;
+			mem->idle_page_scans++;
+			write_seqcount_end(&mem->idle_page_stats_lock);
+		}
+
+		schedule_timeout_interruptible(scan_seconds * HZ);
+	}
+
+	BUG();
+	return 0;	/* NOT REACHED */
+}
+
+static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", kstaled_scan_seconds);
+}
+
+static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+
+	err = strict_strtoul(buf, 10, &input);
+	if (err)
+		return -EINVAL;
+	kstaled_scan_seconds = input;
+	wake_up_interruptible(&kstaled_wait);
+	return count;
+}
+
+static struct kobj_attribute kstaled_scan_seconds_attr = __ATTR(
+	scan_seconds, 0644,
+	kstaled_scan_seconds_show, kstaled_scan_seconds_store);
+
+static struct attribute *kstaled_attrs[] = {
+	&kstaled_scan_seconds_attr.attr,
+	NULL
+};
+static struct attribute_group kstaled_attr_group = {
+	.name = "kstaled",
+	.attrs = kstaled_attrs,
+};
+
+static int __init kstaled_init(void)
+{
+	int error;
+	struct task_struct *thread;
+
+	error = sysfs_create_group(mm_kobj, &kstaled_attr_group);
+	if (error) {
+		pr_err("Failed to create kstaled sysfs node\n");
+		return error;
+	}
+
+	thread = kthread_run(kstaled, NULL, "kstaled");
+	if (IS_ERR(thread)) {
+		pr_err("Failed to start kstaled\n");
+		return PTR_ERR(thread);
+	}
+
+	return 0;
+}
+module_init(kstaled_init);
+
+#endif /* CONFIG_KSTALED */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/8] kstaled: minimalistic implementation.
@ 2011-09-17  3:39   ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Introduce minimal kstaled implementation. The scan rate is controlled by
/sys/kernel/mm/kstaled/scan_seconds and per-cgroup statistics are output
into /dev/cgroup/*/memory.idle_page_stats.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memcontrol.c |  291 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 291 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..aebd45a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,8 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/kthread.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -283,6 +285,16 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+#ifdef CONFIG_KSTALED
+	seqcount_t idle_page_stats_lock;
+	struct idle_page_stats {
+		unsigned long idle_clean;
+		unsigned long idle_dirty_file;
+		unsigned long idle_dirty_swap;
+	} idle_page_stats, idle_scan_stats;
+	unsigned long idle_page_scans;
+#endif
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4668,6 +4680,30 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 }
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_KSTALED
+static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	unsigned int seqcount;
+	struct idle_page_stats stats;
+	unsigned long scans;
+
+	do {
+		seqcount = read_seqcount_begin(&mem->idle_page_stats_lock);
+		stats = mem->idle_page_stats;
+		scans = mem->idle_page_scans;
+	} while (read_seqcount_retry(&mem->idle_page_stats_lock, seqcount));
+
+	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
+	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
+	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
+	cb->fill(cb, "scans", scans);
+
+	return 0;
+}
+#endif /* CONFIG_KSTALED */
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4738,6 +4774,12 @@ static struct cftype mem_cgroup_files[] = {
 		.mode = S_IRUGO,
 	},
 #endif
+#ifdef CONFIG_KSTALED
+	{
+		.name = "idle_page_stats",
+		.read_map = mem_cgroup_idle_page_stats_read,
+	},
+#endif
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -5001,6 +5043,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
+#ifdef CONFIG_KSTALED
+	seqcount_init(&mem->idle_page_stats_lock);
+#endif
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
@@ -5568,3 +5613,249 @@ static int __init enable_swap_account(char *s)
 __setup("swapaccount=", enable_swap_account);
 
 #endif
+
+#ifdef CONFIG_KSTALED
+
+static unsigned int kstaled_scan_seconds;
+static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
+
+static inline void kstaled_scan_page(struct page *page)
+{
+	bool is_locked = false;
+	bool is_file;
+	struct pr_info info;
+	struct page_cgroup *pc;
+	struct idle_page_stats *stats;
+
+	/*
+	 * Before taking the page reference, check if the page is
+	 * a user page which is not obviously unreclaimable
+	 * (we will do more complete checks later).
+	 */
+	if (!PageLRU(page) || PageMlocked(page) ||
+	    (page->mapping == NULL && !PageSwapCache(page)))
+		return;
+
+	if (!get_page_unless_zero(page))
+		return;
+
+	/* Recheck now that we have the page reference. */
+	if (unlikely(!PageLRU(page) || PageMlocked(page)))
+		goto out;
+
+	/*
+	 * Anon and SwapCache pages can be identified without locking.
+	 * For all other cases, we need the page locked in order to
+	 * dereference page->mapping.
+	 */
+	if (PageAnon(page) || PageSwapCache(page))
+		is_file = false;
+	else if (!trylock_page(page)) {
+		/*
+		 * We need to lock the page to dereference the mapping.
+		 * But don't risk sleeping by calling lock_page().
+		 * We don't want to stall kstaled, so we conservatively
+		 * count locked pages as unreclaimable.
+		 */
+		goto out;
+	} else {
+		struct address_space *mapping = page->mapping;
+
+		is_locked = true;
+
+		/*
+		 * The page is still anon - it has been continuously referenced
+		 * since the prior check.
+		 */
+		VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
+
+		/*
+		 * Check the mapping under protection of the page lock.
+		 * 1. If the page is not swap cache and has no mapping,
+		 *    shrink_page_list can't do anything with it.
+		 * 2. If the mapping is unevictable (as in SHM_LOCK segments),
+		 *    shrink_page_list can't do anything with it.
+		 * 3. If the page is swap cache or the mapping is swap backed
+		 *    (as in shmem), consider it a swappable page.
+		 * 4. If the backing dev has indicated that it does not want
+		 *    its pages sync'd to disk (as in ramfs), take this as
+		 *    a hint that its pages are not reclaimable.
+		 * 5. Otherwise, consider this as a file page reclaimable
+		 *    through standard pageout.
+		 */
+		if (!mapping && !PageSwapCache(page))
+			goto out;
+		else if (mapping_unevictable(mapping))
+			goto out;
+		else if (PageSwapCache(page) ||
+			 mapping_cap_swap_backed(mapping))
+			is_file = false;
+		else if (!mapping_cap_writeback_dirty(mapping))
+			goto out;
+		else
+			is_file = true;
+	}
+
+	/* Find out if the page is idle. Also test for pending mlock. */
+	page_referenced_kstaled(page, is_locked, &info);
+	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
+		goto out;
+
+	/* Locate kstaled stats for the page's cgroup. */
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		goto out;
+	lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc))
+		goto unlock_page_cgroup_out;
+	stats = &pc->mem_cgroup->idle_scan_stats;
+
+	/* Finally increment the correct statistic for this page. */
+	if (!(info.pr_flags & PR_DIRTY) &&
+	    !PageDirty(page) && !PageWriteback(page))
+		stats->idle_clean++;
+	else if (is_file)
+		stats->idle_dirty_file++;
+	else
+		stats->idle_dirty_swap++;
+
+ unlock_page_cgroup_out:
+	unlock_page_cgroup(pc);
+
+ out:
+	if (is_locked)
+		unlock_page(page);
+	put_page(page);
+}
+
+static void kstaled_scan_node(pg_data_t *pgdat)
+{
+	unsigned long flags;
+	unsigned long start, end, pfn;
+
+	pgdat_resize_lock(pgdat, &flags);
+
+	start = pgdat->node_start_pfn;
+	end = start + pgdat->node_spanned_pages;
+
+	for (pfn = start; pfn < end; pfn++) {
+		if (need_resched()) {
+			pgdat_resize_unlock(pgdat, &flags);
+			cond_resched();
+			pgdat_resize_lock(pgdat, &flags);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+			/* abort if the node got resized */
+			if (pfn < pgdat->node_start_pfn ||
+			    end > (pgdat->node_start_pfn +
+				   pgdat->node_spanned_pages))
+				goto abort;
+#endif
+		}
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		kstaled_scan_page(pfn_to_page(pfn));
+	}
+
+abort:
+	pgdat_resize_unlock(pgdat, &flags);
+}
+
+static int kstaled(void *dummy)
+{
+	while (1) {
+		int scan_seconds;
+		int nid;
+		struct mem_cgroup *mem;
+
+		wait_event_interruptible(kstaled_wait,
+				 (scan_seconds = kstaled_scan_seconds) > 0);
+		/*
+		 * We use interruptible wait_event so as not to contribute
+		 * to the machine load average while we're sleeping.
+		 * However, we don't actually expect to receive a signal
+		 * since we run as a kernel thread, so the condition we were
+		 * waiting for should be true once we get here.
+		 */
+		BUG_ON(scan_seconds <= 0);
+
+		for_each_mem_cgroup_all(mem)
+			memset(&mem->idle_scan_stats, 0,
+			       sizeof(mem->idle_scan_stats));
+
+		for_each_node_state(nid, N_HIGH_MEMORY)
+			kstaled_scan_node(NODE_DATA(nid));
+
+		for_each_mem_cgroup_all(mem) {
+			write_seqcount_begin(&mem->idle_page_stats_lock);
+			mem->idle_page_stats = mem->idle_scan_stats;
+			mem->idle_page_scans++;
+			write_seqcount_end(&mem->idle_page_stats_lock);
+		}
+
+		schedule_timeout_interruptible(scan_seconds * HZ);
+	}
+
+	BUG();
+	return 0;	/* NOT REACHED */
+}
+
+static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", kstaled_scan_seconds);
+}
+
+static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+
+	err = strict_strtoul(buf, 10, &input);
+	if (err)
+		return -EINVAL;
+	kstaled_scan_seconds = input;
+	wake_up_interruptible(&kstaled_wait);
+	return count;
+}
+
+static struct kobj_attribute kstaled_scan_seconds_attr = __ATTR(
+	scan_seconds, 0644,
+	kstaled_scan_seconds_show, kstaled_scan_seconds_store);
+
+static struct attribute *kstaled_attrs[] = {
+	&kstaled_scan_seconds_attr.attr,
+	NULL
+};
+static struct attribute_group kstaled_attr_group = {
+	.name = "kstaled",
+	.attrs = kstaled_attrs,
+};
+
+static int __init kstaled_init(void)
+{
+	int error;
+	struct task_struct *thread;
+
+	error = sysfs_create_group(mm_kobj, &kstaled_attr_group);
+	if (error) {
+		pr_err("Failed to create kstaled sysfs node\n");
+		return error;
+	}
+
+	thread = kthread_run(kstaled, NULL, "kstaled");
+	if (IS_ERR(thread)) {
+		pr_err("Failed to start kstaled\n");
+		return PTR_ERR(thread);
+	}
+
+	return 0;
+}
+module_init(kstaled_init);
+
+#endif /* CONFIG_KSTALED */
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 5/8] kstaled: skip non-RAM regions.
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-17  3:39   ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Add a pfn_skip_hole function that shrinks the passed input range in order to
skip over pfn ranges that are known not bo be RAM backed. The x86
implementation achieves this using e820 tables; other architectures
use a generic no-op implementation.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/page_types.h |    8 ++++++
 arch/x86/kernel/e820.c            |   45 +++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h            |    6 +++++
 mm/memcontrol.c                   |   41 +++++++++++++++++++--------------
 4 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index bce688d..b0676c2 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -57,6 +57,14 @@ extern unsigned long init_memory_mapping(unsigned long start,
 extern void initmem_init(void);
 extern void free_initmem(void);
 
+extern void e820_skip_hole(unsigned long *start_pfn, unsigned long *end_pfn);
+
+#define ARCH_HAVE_PFN_SKIP_HOLE 1
+static inline void pfn_skip_hole(unsigned long *start, unsigned long *end)
+{
+	e820_skip_hole(start, end);
+}
+
 #endif	/* !__ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PAGE_DEFS_H */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 3e2ef84..0677873 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1123,3 +1123,48 @@ void __init memblock_find_dma_reserve(void)
 	set_dma_reserve(mem_size_pfn - free_size_pfn);
 #endif
 }
+
+/*
+ * The caller wants to skip pfns that are guaranteed to not be valid
+ * memory. Find a stretch of ram between [start_pfn, end_pfn) and
+ * return its pfn range back through start_pfn and end_pfn.
+ */
+
+void e820_skip_hole(unsigned long *start_pfn, unsigned long *end_pfn)
+{
+	unsigned long start = *start_pfn << PAGE_SHIFT;
+	unsigned long end = *end_pfn << PAGE_SHIFT;
+	int i;
+
+	if (start >= end)
+		goto fail;		/* short-circuit e820 checks */
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		unsigned long last, addr;
+
+		addr = round_up(ei->addr, PAGE_SIZE);
+		last = round_down(ei->addr + ei->size, PAGE_SIZE);
+
+		if (addr >= end)
+			goto fail;	/* We're done, not found */
+		if (last <= start)
+			continue;	/* Not at start yet, move on */
+		if (ei->type != E820_RAM)
+			continue;	/* Not RAM, move on */
+
+		/*
+		 * We've found RAM. If start is in this e820 range, return
+		 * it, otherwise return the start of this e820 range.
+		 */
+
+		if (addr > start)
+			*start_pfn = addr >> PAGE_SHIFT;
+		if (last < end)
+			*end_pfn = last >> PAGE_SHIFT;
+		return;
+	}
+fail:
+	*start_pfn = *end_pfn;
+	return;				/* No luck, return failure */
+}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..6657106 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -930,6 +930,12 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
 #define pfn_to_nid(pfn)		(0)
 #endif
 
+#ifndef ARCH_HAVE_PFN_SKIP_HOLE
+static inline void pfn_skip_hole(unsigned long *start, unsigned long *end)
+{
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aebd45a..0fdc278 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5731,32 +5731,39 @@ static inline void kstaled_scan_page(struct page *page)
 static void kstaled_scan_node(pg_data_t *pgdat)
 {
 	unsigned long flags;
-	unsigned long start, end, pfn;
+	unsigned long pfn, end;
 
 	pgdat_resize_lock(pgdat, &flags);
 
-	start = pgdat->node_start_pfn;
-	end = start + pgdat->node_spanned_pages;
+	pfn = pgdat->node_start_pfn;
+	end = pfn + pgdat->node_spanned_pages;
 
-	for (pfn = start; pfn < end; pfn++) {
-		if (need_resched()) {
-			pgdat_resize_unlock(pgdat, &flags);
-			cond_resched();
-			pgdat_resize_lock(pgdat, &flags);
+	while (pfn < end) {
+		unsigned long contiguous = end;
+
+		/* restrict pfn..contiguous to be a RAM backed range */
+		pfn_skip_hole(&pfn, &contiguous);
+
+		for (; pfn < contiguous; pfn++) {
+			if (need_resched()) {
+				pgdat_resize_unlock(pgdat, &flags);
+				cond_resched();
+				pgdat_resize_lock(pgdat, &flags);
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-			/* abort if the node got resized */
-			if (pfn < pgdat->node_start_pfn ||
-			    end > (pgdat->node_start_pfn +
-				   pgdat->node_spanned_pages))
-				goto abort;
+				/* abort if the node got resized */
+				if (pfn < pgdat->node_start_pfn ||
+				    end > (pgdat->node_start_pfn +
+					   pgdat->node_spanned_pages))
+					goto abort;
 #endif
-		}
+			}
 
-		if (!pfn_valid(pfn))
-			continue;
+			if (!pfn_valid(pfn))
+				continue;
 
-		kstaled_scan_page(pfn_to_page(pfn));
+			kstaled_scan_page(pfn_to_page(pfn));
+		}
 	}
 
 abort:
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 5/8] kstaled: skip non-RAM regions.
@ 2011-09-17  3:39   ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Add a pfn_skip_hole function that shrinks the passed input range in order to
skip over pfn ranges that are known not bo be RAM backed. The x86
implementation achieves this using e820 tables; other architectures
use a generic no-op implementation.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/page_types.h |    8 ++++++
 arch/x86/kernel/e820.c            |   45 +++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h            |    6 +++++
 mm/memcontrol.c                   |   41 +++++++++++++++++++--------------
 4 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index bce688d..b0676c2 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -57,6 +57,14 @@ extern unsigned long init_memory_mapping(unsigned long start,
 extern void initmem_init(void);
 extern void free_initmem(void);
 
+extern void e820_skip_hole(unsigned long *start_pfn, unsigned long *end_pfn);
+
+#define ARCH_HAVE_PFN_SKIP_HOLE 1
+static inline void pfn_skip_hole(unsigned long *start, unsigned long *end)
+{
+	e820_skip_hole(start, end);
+}
+
 #endif	/* !__ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PAGE_DEFS_H */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 3e2ef84..0677873 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1123,3 +1123,48 @@ void __init memblock_find_dma_reserve(void)
 	set_dma_reserve(mem_size_pfn - free_size_pfn);
 #endif
 }
+
+/*
+ * The caller wants to skip pfns that are guaranteed to not be valid
+ * memory. Find a stretch of ram between [start_pfn, end_pfn) and
+ * return its pfn range back through start_pfn and end_pfn.
+ */
+
+void e820_skip_hole(unsigned long *start_pfn, unsigned long *end_pfn)
+{
+	unsigned long start = *start_pfn << PAGE_SHIFT;
+	unsigned long end = *end_pfn << PAGE_SHIFT;
+	int i;
+
+	if (start >= end)
+		goto fail;		/* short-circuit e820 checks */
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		unsigned long last, addr;
+
+		addr = round_up(ei->addr, PAGE_SIZE);
+		last = round_down(ei->addr + ei->size, PAGE_SIZE);
+
+		if (addr >= end)
+			goto fail;	/* We're done, not found */
+		if (last <= start)
+			continue;	/* Not at start yet, move on */
+		if (ei->type != E820_RAM)
+			continue;	/* Not RAM, move on */
+
+		/*
+		 * We've found RAM. If start is in this e820 range, return
+		 * it, otherwise return the start of this e820 range.
+		 */
+
+		if (addr > start)
+			*start_pfn = addr >> PAGE_SHIFT;
+		if (last < end)
+			*end_pfn = last >> PAGE_SHIFT;
+		return;
+	}
+fail:
+	*start_pfn = *end_pfn;
+	return;				/* No luck, return failure */
+}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..6657106 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -930,6 +930,12 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
 #define pfn_to_nid(pfn)		(0)
 #endif
 
+#ifndef ARCH_HAVE_PFN_SKIP_HOLE
+static inline void pfn_skip_hole(unsigned long *start, unsigned long *end)
+{
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aebd45a..0fdc278 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5731,32 +5731,39 @@ static inline void kstaled_scan_page(struct page *page)
 static void kstaled_scan_node(pg_data_t *pgdat)
 {
 	unsigned long flags;
-	unsigned long start, end, pfn;
+	unsigned long pfn, end;
 
 	pgdat_resize_lock(pgdat, &flags);
 
-	start = pgdat->node_start_pfn;
-	end = start + pgdat->node_spanned_pages;
+	pfn = pgdat->node_start_pfn;
+	end = pfn + pgdat->node_spanned_pages;
 
-	for (pfn = start; pfn < end; pfn++) {
-		if (need_resched()) {
-			pgdat_resize_unlock(pgdat, &flags);
-			cond_resched();
-			pgdat_resize_lock(pgdat, &flags);
+	while (pfn < end) {
+		unsigned long contiguous = end;
+
+		/* restrict pfn..contiguous to be a RAM backed range */
+		pfn_skip_hole(&pfn, &contiguous);
+
+		for (; pfn < contiguous; pfn++) {
+			if (need_resched()) {
+				pgdat_resize_unlock(pgdat, &flags);
+				cond_resched();
+				pgdat_resize_lock(pgdat, &flags);
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-			/* abort if the node got resized */
-			if (pfn < pgdat->node_start_pfn ||
-			    end > (pgdat->node_start_pfn +
-				   pgdat->node_spanned_pages))
-				goto abort;
+				/* abort if the node got resized */
+				if (pfn < pgdat->node_start_pfn ||
+				    end > (pgdat->node_start_pfn +
+					   pgdat->node_spanned_pages))
+					goto abort;
 #endif
-		}
+			}
 
-		if (!pfn_valid(pfn))
-			continue;
+			if (!pfn_valid(pfn))
+				continue;
 
-		kstaled_scan_page(pfn_to_page(pfn));
+			kstaled_scan_page(pfn_to_page(pfn));
+		}
 	}
 
 abort:
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 6/8] kstaled: rate limit pages scanned per second.
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-17  3:39   ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Scan some number of pages from each node every second, instead of trying to
scan the entime memory at once and being idle for the rest of the configured
interval.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mmzone.h |    3 ++
 mm/memcontrol.c        |   85 +++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 72 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6657106..272fbed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -631,6 +631,9 @@ typedef struct pglist_data {
 	unsigned long node_present_pages; /* total number of physical pages */
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
+#ifdef CONFIG_KSTALED
+	unsigned long node_idle_scan_pfn;
+#endif
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0fdc278..4a76fdcf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5617,6 +5617,7 @@ __setup("swapaccount=", enable_swap_account);
 #ifdef CONFIG_KSTALED
 
 static unsigned int kstaled_scan_seconds;
+static DEFINE_SPINLOCK(kstaled_scan_seconds_lock);
 static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
 
 static inline void kstaled_scan_page(struct page *page)
@@ -5728,15 +5729,19 @@ static inline void kstaled_scan_page(struct page *page)
 	put_page(page);
 }
 
-static void kstaled_scan_node(pg_data_t *pgdat)
+static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
-	unsigned long pfn, end;
+	unsigned long pfn, end, node_end;
 
 	pgdat_resize_lock(pgdat, &flags);
 
 	pfn = pgdat->node_start_pfn;
-	end = pfn + pgdat->node_spanned_pages;
+	node_end = pfn + pgdat->node_spanned_pages;
+	if (!reset && pfn < pgdat->node_idle_scan_pfn)
+		pfn = pgdat->node_idle_scan_pfn;
+	end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, scan_seconds),
+		  node_end);
 
 	while (pfn < end) {
 		unsigned long contiguous = end;
@@ -5753,8 +5758,8 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 #ifdef CONFIG_MEMORY_HOTPLUG
 				/* abort if the node got resized */
 				if (pfn < pgdat->node_start_pfn ||
-				    end > (pgdat->node_start_pfn +
-					   pgdat->node_spanned_pages))
+				    node_end > (pgdat->node_start_pfn +
+						pgdat->node_spanned_pages))
 					goto abort;
 #endif
 			}
@@ -5768,14 +5773,21 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 
 abort:
 	pgdat_resize_unlock(pgdat, &flags);
+
+	pgdat->node_idle_scan_pfn = pfn;
+	return pfn >= node_end;
 }
 
 static int kstaled(void *dummy)
 {
+	int delayed = 0;
+	bool reset = true;
+
 	while (1) {
 		int scan_seconds;
 		int nid;
-		struct mem_cgroup *mem;
+		long earlier, delta;
+		bool scan_done;
 
 		wait_event_interruptible(kstaled_wait,
 				 (scan_seconds = kstaled_scan_seconds) > 0);
@@ -5788,21 +5800,60 @@ static int kstaled(void *dummy)
 		 */
 		BUG_ON(scan_seconds <= 0);
 
-		for_each_mem_cgroup_all(mem)
-			memset(&mem->idle_scan_stats, 0,
-			       sizeof(mem->idle_scan_stats));
+		earlier = jiffies;
 
+		scan_done = true;
 		for_each_node_state(nid, N_HIGH_MEMORY)
-			kstaled_scan_node(NODE_DATA(nid));
+			scan_done &= kstaled_scan_node(NODE_DATA(nid),
+						       scan_seconds, reset);
+
+		if (scan_done) {
+			struct mem_cgroup *mem;
+
+			for_each_mem_cgroup_all(mem) {
+				write_seqcount_begin(&mem->idle_page_stats_lock);
+				mem->idle_page_stats = mem->idle_scan_stats;
+				mem->idle_page_scans++;
+				write_seqcount_end(&mem->idle_page_stats_lock);
+				memset(&mem->idle_scan_stats, 0,
+				       sizeof(mem->idle_scan_stats));
+			}
+		}
 
-		for_each_mem_cgroup_all(mem) {
-			write_seqcount_begin(&mem->idle_page_stats_lock);
-			mem->idle_page_stats = mem->idle_scan_stats;
-			mem->idle_page_scans++;
-			write_seqcount_end(&mem->idle_page_stats_lock);
+		delta = jiffies - earlier;
+		if (delta < HZ / 2) {
+			delayed = 0;
+			schedule_timeout_interruptible(HZ - delta);
+		} else {
+			/*
+			 * Emergency throttle if we're taking too long.
+			 * We are supposed to scan an entire slice in 1 second.
+			 * If we keep taking longer for 10 consecutive times,
+			 * scale back our scan_seconds.
+			 *
+			 * If someone changed kstaled_scan_seconds while we
+			 * were running, hope they know what they're doing and
+			 * assume they've eliminated any delays.
+			 */
+			bool updated = false;
+			spin_lock(&kstaled_scan_seconds_lock);
+			if (scan_seconds != kstaled_scan_seconds)
+				delayed = 0;
+			else if (++delayed == 10) {
+				delayed = 0;
+				scan_seconds *= 2;
+				kstaled_scan_seconds = scan_seconds;
+				updated = true;
+			}
+			spin_unlock(&kstaled_scan_seconds_lock);
+			if (updated)
+				pr_warning("kstaled taking too long, "
+					   "scan_seconds now %d\n",
+					   scan_seconds);
+			schedule_timeout_interruptible(HZ / 2);
 		}
 
-		schedule_timeout_interruptible(scan_seconds * HZ);
+		reset = scan_done;
 	}
 
 	BUG();
@@ -5826,7 +5877,9 @@ static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
 		return -EINVAL;
+	spin_lock(&kstaled_scan_seconds_lock);
 	kstaled_scan_seconds = input;
+	spin_unlock(&kstaled_scan_seconds_lock);
 	wake_up_interruptible(&kstaled_wait);
 	return count;
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 6/8] kstaled: rate limit pages scanned per second.
@ 2011-09-17  3:39   ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Scan some number of pages from each node every second, instead of trying to
scan the entime memory at once and being idle for the rest of the configured
interval.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mmzone.h |    3 ++
 mm/memcontrol.c        |   85 +++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 72 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6657106..272fbed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -631,6 +631,9 @@ typedef struct pglist_data {
 	unsigned long node_present_pages; /* total number of physical pages */
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
+#ifdef CONFIG_KSTALED
+	unsigned long node_idle_scan_pfn;
+#endif
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0fdc278..4a76fdcf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5617,6 +5617,7 @@ __setup("swapaccount=", enable_swap_account);
 #ifdef CONFIG_KSTALED
 
 static unsigned int kstaled_scan_seconds;
+static DEFINE_SPINLOCK(kstaled_scan_seconds_lock);
 static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
 
 static inline void kstaled_scan_page(struct page *page)
@@ -5728,15 +5729,19 @@ static inline void kstaled_scan_page(struct page *page)
 	put_page(page);
 }
 
-static void kstaled_scan_node(pg_data_t *pgdat)
+static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
-	unsigned long pfn, end;
+	unsigned long pfn, end, node_end;
 
 	pgdat_resize_lock(pgdat, &flags);
 
 	pfn = pgdat->node_start_pfn;
-	end = pfn + pgdat->node_spanned_pages;
+	node_end = pfn + pgdat->node_spanned_pages;
+	if (!reset && pfn < pgdat->node_idle_scan_pfn)
+		pfn = pgdat->node_idle_scan_pfn;
+	end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, scan_seconds),
+		  node_end);
 
 	while (pfn < end) {
 		unsigned long contiguous = end;
@@ -5753,8 +5758,8 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 #ifdef CONFIG_MEMORY_HOTPLUG
 				/* abort if the node got resized */
 				if (pfn < pgdat->node_start_pfn ||
-				    end > (pgdat->node_start_pfn +
-					   pgdat->node_spanned_pages))
+				    node_end > (pgdat->node_start_pfn +
+						pgdat->node_spanned_pages))
 					goto abort;
 #endif
 			}
@@ -5768,14 +5773,21 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 
 abort:
 	pgdat_resize_unlock(pgdat, &flags);
+
+	pgdat->node_idle_scan_pfn = pfn;
+	return pfn >= node_end;
 }
 
 static int kstaled(void *dummy)
 {
+	int delayed = 0;
+	bool reset = true;
+
 	while (1) {
 		int scan_seconds;
 		int nid;
-		struct mem_cgroup *mem;
+		long earlier, delta;
+		bool scan_done;
 
 		wait_event_interruptible(kstaled_wait,
 				 (scan_seconds = kstaled_scan_seconds) > 0);
@@ -5788,21 +5800,60 @@ static int kstaled(void *dummy)
 		 */
 		BUG_ON(scan_seconds <= 0);
 
-		for_each_mem_cgroup_all(mem)
-			memset(&mem->idle_scan_stats, 0,
-			       sizeof(mem->idle_scan_stats));
+		earlier = jiffies;
 
+		scan_done = true;
 		for_each_node_state(nid, N_HIGH_MEMORY)
-			kstaled_scan_node(NODE_DATA(nid));
+			scan_done &= kstaled_scan_node(NODE_DATA(nid),
+						       scan_seconds, reset);
+
+		if (scan_done) {
+			struct mem_cgroup *mem;
+
+			for_each_mem_cgroup_all(mem) {
+				write_seqcount_begin(&mem->idle_page_stats_lock);
+				mem->idle_page_stats = mem->idle_scan_stats;
+				mem->idle_page_scans++;
+				write_seqcount_end(&mem->idle_page_stats_lock);
+				memset(&mem->idle_scan_stats, 0,
+				       sizeof(mem->idle_scan_stats));
+			}
+		}
 
-		for_each_mem_cgroup_all(mem) {
-			write_seqcount_begin(&mem->idle_page_stats_lock);
-			mem->idle_page_stats = mem->idle_scan_stats;
-			mem->idle_page_scans++;
-			write_seqcount_end(&mem->idle_page_stats_lock);
+		delta = jiffies - earlier;
+		if (delta < HZ / 2) {
+			delayed = 0;
+			schedule_timeout_interruptible(HZ - delta);
+		} else {
+			/*
+			 * Emergency throttle if we're taking too long.
+			 * We are supposed to scan an entire slice in 1 second.
+			 * If we keep taking longer for 10 consecutive times,
+			 * scale back our scan_seconds.
+			 *
+			 * If someone changed kstaled_scan_seconds while we
+			 * were running, hope they know what they're doing and
+			 * assume they've eliminated any delays.
+			 */
+			bool updated = false;
+			spin_lock(&kstaled_scan_seconds_lock);
+			if (scan_seconds != kstaled_scan_seconds)
+				delayed = 0;
+			else if (++delayed == 10) {
+				delayed = 0;
+				scan_seconds *= 2;
+				kstaled_scan_seconds = scan_seconds;
+				updated = true;
+			}
+			spin_unlock(&kstaled_scan_seconds_lock);
+			if (updated)
+				pr_warning("kstaled taking too long, "
+					   "scan_seconds now %d\n",
+					   scan_seconds);
+			schedule_timeout_interruptible(HZ / 2);
 		}
 
-		schedule_timeout_interruptible(scan_seconds * HZ);
+		reset = scan_done;
 	}
 
 	BUG();
@@ -5826,7 +5877,9 @@ static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
 	err = strict_strtoul(buf, 10, &input);
 	if (err)
 		return -EINVAL;
+	spin_lock(&kstaled_scan_seconds_lock);
 	kstaled_scan_seconds = input;
+	spin_unlock(&kstaled_scan_seconds_lock);
 	wake_up_interruptible(&kstaled_wait);
 	return count;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 7/8] kstaled: add histogram sampling functionality
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-17  3:39   ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
240 scan intervals into /dev/cgroup/*/memory.idle_page_stats


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mmzone.h |    2 +
 mm/memcontrol.c        |  103 +++++++++++++++++++++++++++++++++++++++---------
 mm/memory_hotplug.c    |    6 +++
 3 files changed, 92 insertions(+), 19 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 272fbed..d8eca1b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -633,6 +633,8 @@ typedef struct pglist_data {
 					     range, including holes */
 #ifdef CONFIG_KSTALED
 	unsigned long node_idle_scan_pfn;
+	u8 *node_idle_page_age;           /* number of scan intervals since
+					     each page was referenced */
 #endif
 	int node_id;
 	wait_queue_head_t kswapd_wait;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a76fdcf..ef406a1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -207,6 +207,11 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
 
+#ifdef CONFIG_KSTALED
+static const int kstaled_buckets[] = {1, 2, 5, 15, 30, 60, 120, 240};
+#define NUM_KSTALED_BUCKETS ARRAY_SIZE(kstaled_buckets)
+#endif
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -292,7 +297,8 @@ struct mem_cgroup {
 		unsigned long idle_clean;
 		unsigned long idle_dirty_file;
 		unsigned long idle_dirty_swap;
-	} idle_page_stats, idle_scan_stats;
+	} idle_page_stats[NUM_KSTALED_BUCKETS],
+	  idle_scan_stats[NUM_KSTALED_BUCKETS];
 	unsigned long idle_page_scans;
 #endif
 };
@@ -4686,18 +4692,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
 	unsigned int seqcount;
-	struct idle_page_stats stats;
+	struct idle_page_stats stats[NUM_KSTALED_BUCKETS];
 	unsigned long scans;
+	int bucket;
 
 	do {
 		seqcount = read_seqcount_begin(&mem->idle_page_stats_lock);
-		stats = mem->idle_page_stats;
+		memcpy(stats, mem->idle_page_stats, sizeof(stats));
 		scans = mem->idle_page_scans;
 	} while (read_seqcount_retry(&mem->idle_page_stats_lock, seqcount));
 
-	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
-	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
-	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
+	for (bucket = 0; bucket < NUM_KSTALED_BUCKETS; bucket++) {
+		char basename[32], name[32];
+		if (!bucket)
+			sprintf(basename, "idle");
+		else
+			sprintf(basename, "idle_%d", kstaled_buckets[bucket]);
+		sprintf(name, "%s_clean", basename);
+		cb->fill(cb, name, stats[bucket].idle_clean * PAGE_SIZE);
+		sprintf(name, "%s_dirty_file", basename);
+		cb->fill(cb, name, stats[bucket].idle_dirty_file * PAGE_SIZE);
+		sprintf(name, "%s_dirty_swap", basename);
+		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
+	}
 	cb->fill(cb, "scans", scans);
 
 	return 0;
@@ -5620,12 +5637,25 @@ static unsigned int kstaled_scan_seconds;
 static DEFINE_SPINLOCK(kstaled_scan_seconds_lock);
 static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
 
-static inline void kstaled_scan_page(struct page *page)
+static inline struct idle_page_stats *
+kstaled_idle_stats(struct mem_cgroup *mem, int age)
+{
+	int bucket = 0;
+
+	while (age >= kstaled_buckets[bucket + 1])
+		if (++bucket == NUM_KSTALED_BUCKETS - 1)
+			break;
+	return mem->idle_scan_stats + bucket;
+}
+
+static inline void kstaled_scan_page(struct page *page, u8 *idle_page_age)
 {
 	bool is_locked = false;
 	bool is_file;
 	struct pr_info info;
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+	int age;
 	struct idle_page_stats *stats;
 
 	/*
@@ -5699,17 +5729,25 @@ static inline void kstaled_scan_page(struct page *page)
 
 	/* Find out if the page is idle. Also test for pending mlock. */
 	page_referenced_kstaled(page, is_locked, &info);
-	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
+	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED)) {
+		*idle_page_age = 0;
 		goto out;
+	}
 
 	/* Locate kstaled stats for the page's cgroup. */
 	pc = lookup_page_cgroup(page);
 	if (!pc)
 		goto out;
 	lock_page_cgroup(pc);
+	mem = pc->mem_cgroup;
 	if (!PageCgroupUsed(pc))
 		goto unlock_page_cgroup_out;
-	stats = &pc->mem_cgroup->idle_scan_stats;
+
+	/* Page is idle, increment its age and get the right stats bucket */
+	age = *idle_page_age;
+	if (age < 255)
+		*idle_page_age = ++age;
+	stats = kstaled_idle_stats(mem, age);
 
 	/* Finally increment the correct statistic for this page. */
 	if (!(info.pr_flags & PR_DIRTY) &&
@@ -5733,11 +5771,22 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
 	unsigned long pfn, end, node_end;
+	u8 *idle_page_age;
 
 	pgdat_resize_lock(pgdat, &flags);
 
+	if (!pgdat->node_idle_page_age) {
+		pgdat->node_idle_page_age = vmalloc(pgdat->node_spanned_pages);
+		if (!pgdat->node_idle_page_age) {
+			pgdat_resize_unlock(pgdat, &flags);
+			return false;
+		}
+		memset(pgdat->node_idle_page_age, 0, pgdat->node_spanned_pages);
+	}
+
 	pfn = pgdat->node_start_pfn;
 	node_end = pfn + pgdat->node_spanned_pages;
+	idle_page_age = pgdat->node_idle_page_age - pfn;
 	if (!reset && pfn < pgdat->node_idle_scan_pfn)
 		pfn = pgdat->node_idle_scan_pfn;
 	end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, scan_seconds),
@@ -5759,7 +5808,8 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 				/* abort if the node got resized */
 				if (pfn < pgdat->node_start_pfn ||
 				    node_end > (pgdat->node_start_pfn +
-						pgdat->node_spanned_pages))
+						pgdat->node_spanned_pages) ||
+				    !pgdat->node_idle_page_age)
 					goto abort;
 #endif
 			}
@@ -5767,7 +5817,8 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 			if (!pfn_valid(pfn))
 				continue;
 
-			kstaled_scan_page(pfn_to_page(pfn));
+			kstaled_scan_page(pfn_to_page(pfn),
+					  idle_page_age + pfn);
 		}
 	}
 
@@ -5778,6 +5829,26 @@ abort:
 	return pfn >= node_end;
 }
 
+static void kstaled_update_stats(struct mem_cgroup *mem)
+{
+	struct idle_page_stats tot;
+	int i;
+
+	memset(&tot, 0, sizeof(tot));
+
+	write_seqcount_begin(&mem->idle_page_stats_lock);
+	for (i = NUM_KSTALED_BUCKETS - 1; i >= 0; i--) {
+		tot.idle_clean      += mem->idle_scan_stats[i].idle_clean;
+		tot.idle_dirty_file += mem->idle_scan_stats[i].idle_dirty_file;
+		tot.idle_dirty_swap += mem->idle_scan_stats[i].idle_dirty_swap;
+		mem->idle_page_stats[i] = tot;
+	}
+	mem->idle_page_scans++;
+	write_seqcount_end(&mem->idle_page_stats_lock);
+
+	memset(&mem->idle_scan_stats, 0, sizeof(mem->idle_scan_stats));
+}
+
 static int kstaled(void *dummy)
 {
 	int delayed = 0;
@@ -5810,14 +5881,8 @@ static int kstaled(void *dummy)
 		if (scan_done) {
 			struct mem_cgroup *mem;
 
-			for_each_mem_cgroup_all(mem) {
-				write_seqcount_begin(&mem->idle_page_stats_lock);
-				mem->idle_page_stats = mem->idle_scan_stats;
-				mem->idle_page_scans++;
-				write_seqcount_end(&mem->idle_page_stats_lock);
-				memset(&mem->idle_scan_stats, 0,
-				       sizeof(mem->idle_scan_stats));
-			}
+			for_each_mem_cgroup_all(mem)
+				kstaled_update_stats(mem);
 		}
 
 		delta = jiffies - earlier;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..0b490ac 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -211,6 +211,12 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 
 	pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
 					pgdat->node_start_pfn;
+#ifdef CONFIG_KSTALED
+	if (pgdat->node_idle_page_age) {
+		vfree(pgdat->node_idle_page_age);
+		pgdat->node_idle_page_age = NULL;
+	}
+#endif
 }
 
 static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 7/8] kstaled: add histogram sampling functionality
@ 2011-09-17  3:39   ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
240 scan intervals into /dev/cgroup/*/memory.idle_page_stats


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mmzone.h |    2 +
 mm/memcontrol.c        |  103 +++++++++++++++++++++++++++++++++++++++---------
 mm/memory_hotplug.c    |    6 +++
 3 files changed, 92 insertions(+), 19 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 272fbed..d8eca1b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -633,6 +633,8 @@ typedef struct pglist_data {
 					     range, including holes */
 #ifdef CONFIG_KSTALED
 	unsigned long node_idle_scan_pfn;
+	u8 *node_idle_page_age;           /* number of scan intervals since
+					     each page was referenced */
 #endif
 	int node_id;
 	wait_queue_head_t kswapd_wait;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a76fdcf..ef406a1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -207,6 +207,11 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
 
+#ifdef CONFIG_KSTALED
+static const int kstaled_buckets[] = {1, 2, 5, 15, 30, 60, 120, 240};
+#define NUM_KSTALED_BUCKETS ARRAY_SIZE(kstaled_buckets)
+#endif
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -292,7 +297,8 @@ struct mem_cgroup {
 		unsigned long idle_clean;
 		unsigned long idle_dirty_file;
 		unsigned long idle_dirty_swap;
-	} idle_page_stats, idle_scan_stats;
+	} idle_page_stats[NUM_KSTALED_BUCKETS],
+	  idle_scan_stats[NUM_KSTALED_BUCKETS];
 	unsigned long idle_page_scans;
 #endif
 };
@@ -4686,18 +4692,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
 	unsigned int seqcount;
-	struct idle_page_stats stats;
+	struct idle_page_stats stats[NUM_KSTALED_BUCKETS];
 	unsigned long scans;
+	int bucket;
 
 	do {
 		seqcount = read_seqcount_begin(&mem->idle_page_stats_lock);
-		stats = mem->idle_page_stats;
+		memcpy(stats, mem->idle_page_stats, sizeof(stats));
 		scans = mem->idle_page_scans;
 	} while (read_seqcount_retry(&mem->idle_page_stats_lock, seqcount));
 
-	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
-	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
-	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
+	for (bucket = 0; bucket < NUM_KSTALED_BUCKETS; bucket++) {
+		char basename[32], name[32];
+		if (!bucket)
+			sprintf(basename, "idle");
+		else
+			sprintf(basename, "idle_%d", kstaled_buckets[bucket]);
+		sprintf(name, "%s_clean", basename);
+		cb->fill(cb, name, stats[bucket].idle_clean * PAGE_SIZE);
+		sprintf(name, "%s_dirty_file", basename);
+		cb->fill(cb, name, stats[bucket].idle_dirty_file * PAGE_SIZE);
+		sprintf(name, "%s_dirty_swap", basename);
+		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
+	}
 	cb->fill(cb, "scans", scans);
 
 	return 0;
@@ -5620,12 +5637,25 @@ static unsigned int kstaled_scan_seconds;
 static DEFINE_SPINLOCK(kstaled_scan_seconds_lock);
 static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
 
-static inline void kstaled_scan_page(struct page *page)
+static inline struct idle_page_stats *
+kstaled_idle_stats(struct mem_cgroup *mem, int age)
+{
+	int bucket = 0;
+
+	while (age >= kstaled_buckets[bucket + 1])
+		if (++bucket == NUM_KSTALED_BUCKETS - 1)
+			break;
+	return mem->idle_scan_stats + bucket;
+}
+
+static inline void kstaled_scan_page(struct page *page, u8 *idle_page_age)
 {
 	bool is_locked = false;
 	bool is_file;
 	struct pr_info info;
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+	int age;
 	struct idle_page_stats *stats;
 
 	/*
@@ -5699,17 +5729,25 @@ static inline void kstaled_scan_page(struct page *page)
 
 	/* Find out if the page is idle. Also test for pending mlock. */
 	page_referenced_kstaled(page, is_locked, &info);
-	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
+	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED)) {
+		*idle_page_age = 0;
 		goto out;
+	}
 
 	/* Locate kstaled stats for the page's cgroup. */
 	pc = lookup_page_cgroup(page);
 	if (!pc)
 		goto out;
 	lock_page_cgroup(pc);
+	mem = pc->mem_cgroup;
 	if (!PageCgroupUsed(pc))
 		goto unlock_page_cgroup_out;
-	stats = &pc->mem_cgroup->idle_scan_stats;
+
+	/* Page is idle, increment its age and get the right stats bucket */
+	age = *idle_page_age;
+	if (age < 255)
+		*idle_page_age = ++age;
+	stats = kstaled_idle_stats(mem, age);
 
 	/* Finally increment the correct statistic for this page. */
 	if (!(info.pr_flags & PR_DIRTY) &&
@@ -5733,11 +5771,22 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
 	unsigned long pfn, end, node_end;
+	u8 *idle_page_age;
 
 	pgdat_resize_lock(pgdat, &flags);
 
+	if (!pgdat->node_idle_page_age) {
+		pgdat->node_idle_page_age = vmalloc(pgdat->node_spanned_pages);
+		if (!pgdat->node_idle_page_age) {
+			pgdat_resize_unlock(pgdat, &flags);
+			return false;
+		}
+		memset(pgdat->node_idle_page_age, 0, pgdat->node_spanned_pages);
+	}
+
 	pfn = pgdat->node_start_pfn;
 	node_end = pfn + pgdat->node_spanned_pages;
+	idle_page_age = pgdat->node_idle_page_age - pfn;
 	if (!reset && pfn < pgdat->node_idle_scan_pfn)
 		pfn = pgdat->node_idle_scan_pfn;
 	end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, scan_seconds),
@@ -5759,7 +5808,8 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 				/* abort if the node got resized */
 				if (pfn < pgdat->node_start_pfn ||
 				    node_end > (pgdat->node_start_pfn +
-						pgdat->node_spanned_pages))
+						pgdat->node_spanned_pages) ||
+				    !pgdat->node_idle_page_age)
 					goto abort;
 #endif
 			}
@@ -5767,7 +5817,8 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 			if (!pfn_valid(pfn))
 				continue;
 
-			kstaled_scan_page(pfn_to_page(pfn));
+			kstaled_scan_page(pfn_to_page(pfn),
+					  idle_page_age + pfn);
 		}
 	}
 
@@ -5778,6 +5829,26 @@ abort:
 	return pfn >= node_end;
 }
 
+static void kstaled_update_stats(struct mem_cgroup *mem)
+{
+	struct idle_page_stats tot;
+	int i;
+
+	memset(&tot, 0, sizeof(tot));
+
+	write_seqcount_begin(&mem->idle_page_stats_lock);
+	for (i = NUM_KSTALED_BUCKETS - 1; i >= 0; i--) {
+		tot.idle_clean      += mem->idle_scan_stats[i].idle_clean;
+		tot.idle_dirty_file += mem->idle_scan_stats[i].idle_dirty_file;
+		tot.idle_dirty_swap += mem->idle_scan_stats[i].idle_dirty_swap;
+		mem->idle_page_stats[i] = tot;
+	}
+	mem->idle_page_scans++;
+	write_seqcount_end(&mem->idle_page_stats_lock);
+
+	memset(&mem->idle_scan_stats, 0, sizeof(mem->idle_scan_stats));
+}
+
 static int kstaled(void *dummy)
 {
 	int delayed = 0;
@@ -5810,14 +5881,8 @@ static int kstaled(void *dummy)
 		if (scan_done) {
 			struct mem_cgroup *mem;
 
-			for_each_mem_cgroup_all(mem) {
-				write_seqcount_begin(&mem->idle_page_stats_lock);
-				mem->idle_page_stats = mem->idle_scan_stats;
-				mem->idle_page_scans++;
-				write_seqcount_end(&mem->idle_page_stats_lock);
-				memset(&mem->idle_scan_stats, 0,
-				       sizeof(mem->idle_scan_stats));
-			}
+			for_each_mem_cgroup_all(mem)
+				kstaled_update_stats(mem);
 		}
 
 		delta = jiffies - earlier;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..0b490ac 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -211,6 +211,12 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 
 	pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
 					pgdat->node_start_pfn;
+#ifdef CONFIG_KSTALED
+	if (pgdat->node_idle_page_age) {
+		vfree(pgdat->node_idle_page_age);
+		pgdat->node_idle_page_age = NULL;
+	}
+#endif
 }
 
 static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 8/8] kstaled: add incrementally updating stale page count
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-17  3:39   ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Add an incrementally updating stale page count. A new per-cgroup
memory.stale_page_age file is introduced. After a non-zero number of scan
cycles is written there, pages that have been idle for at least that number
of cycles and are currently clean are reported in memory.idle_page_stats
as being stale. Contrary to the idle_*_clean statistic, this stale page
count is continually updated - hooks have been added to notice pages being
accessed or rendered unevictable, at which point the stale page count for
that cgroup is instantly decremented. The point is to allow userspace to
quickly respond to increased memory pressure.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/page-flags.h |   15 ++++++++
 include/linux/pagemap.h    |   11 ++++--
 mm/internal.h              |    1 +
 mm/memcontrol.c            |   82 ++++++++++++++++++++++++++++++++++++++++++--
 mm/mlock.c                 |    1 +
 mm/vmscan.c                |    2 +-
 6 files changed, 104 insertions(+), 8 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e964d98..22dbe90 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -58,6 +58,8 @@
  *
  * PG_idle indicates that the page has not been referenced since the last time
  * kstaled scanned it.
+ *
+ * PG_stale indicates that the page is currently counted as stale.
  */
 
 /*
@@ -117,6 +119,7 @@ enum pageflags {
 #ifdef CONFIG_KSTALED
 	PG_young,		/* kstaled cleared pte_young */
 	PG_idle,		/* idle since start of kstaled interval */
+	PG_stale,		/* page is counted as stale */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -293,21 +296,33 @@ PAGEFLAG_FALSE(HWPoison)
 
 PAGEFLAG(Young, young)
 PAGEFLAG(Idle, idle)
+PAGEFLAG(Stale, stale) TESTSCFLAG(Stale, stale)
+
+void __set_page_nonstale(struct page *page);
+
+static inline void set_page_nonstale(struct page *page)
+{
+	if (PageStale(page))
+		__set_page_nonstale(page);
+}
 
 static inline void set_page_young(struct page *page)
 {
+	set_page_nonstale(page);
 	if (!PageYoung(page))
 		SetPageYoung(page);
 }
 
 static inline void clear_page_idle(struct page *page)
 {
+	set_page_nonstale(page);
 	if (PageIdle(page))
 		ClearPageIdle(page);
 }
 
 #else /* !CONFIG_KSTALED */
 
+static inline void set_page_nonstale(struct page *page) {}
 static inline void set_page_young(struct page *page) {}
 static inline void clear_page_idle(struct page *page) {}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..693dd20 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -46,11 +46,14 @@ static inline void mapping_clear_unevictable(struct address_space *mapping)
 	clear_bit(AS_UNEVICTABLE, &mapping->flags);
 }
 
-static inline int mapping_unevictable(struct address_space *mapping)
+static inline int mapping_unevictable(struct address_space *mapping,
+				      struct page *page)
 {
-	if (mapping)
-		return test_bit(AS_UNEVICTABLE, &mapping->flags);
-	return !!mapping;
+	if (mapping && test_bit(AS_UNEVICTABLE, &mapping->flags)) {
+		set_page_nonstale(page);
+		return 1;
+	}
+	return 0;
 }
 
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
diff --git a/mm/internal.h b/mm/internal.h
index d071d38..d1cb0d6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -93,6 +93,7 @@ static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
 		return 0;
 
 	if (!TestSetPageMlocked(page)) {
+		set_page_nonstale(page);
 		inc_zone_page_state(page, NR_MLOCK);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ef406a1..da21830 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,6 +292,8 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 #ifdef CONFIG_KSTALED
+	int stale_page_age;
+
 	seqcount_t idle_page_stats_lock;
 	struct idle_page_stats {
 		unsigned long idle_clean;
@@ -299,6 +301,7 @@ struct mem_cgroup {
 		unsigned long idle_dirty_swap;
 	} idle_page_stats[NUM_KSTALED_BUCKETS],
 	  idle_scan_stats[NUM_KSTALED_BUCKETS];
+	atomic_long_t stale_pages;
 	unsigned long idle_page_scans;
 #endif
 };
@@ -2639,6 +2642,13 @@ static int mem_cgroup_move_account(struct page *page,
 		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
+
+#ifdef CONFIG_KSTALED
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&from->stale_pages);
+#endif
+
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
 		__mem_cgroup_cancel_charge(from, nr_pages);
@@ -3067,6 +3077,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 
 	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -nr_pages);
 
+#ifdef CONFIG_KSTALED
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&mem->stale_pages);
+#endif
+
 	ClearPageCgroupUsed(pc);
 	/*
 	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
@@ -4716,6 +4732,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
 		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
 	}
 	cb->fill(cb, "scans", scans);
+	cb->fill(cb, "stale",
+		 max(atomic_long_read(&mem->stale_pages), 0L) * PAGE_SIZE);
+
+	return 0;
+}
+
+static u64 mem_cgroup_stale_page_age_read(struct cgroup *cgrp,
+					  struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	return mem->stale_page_age;
+}
+
+static int mem_cgroup_stale_page_age_write(struct cgroup *cgrp,
+					   struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	if (val > 255)
+		return -EINVAL;
+
+	mem->stale_page_age = val;
 
 	return 0;
 }
@@ -4796,6 +4835,11 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "idle_page_stats",
 		.read_map = mem_cgroup_idle_page_stats_read,
 	},
+	{
+		.name = "stale_page_age",
+		.read_u64 = mem_cgroup_stale_page_age_read,
+		.write_u64 = mem_cgroup_stale_page_age_write,
+	},
 #endif
 };
 
@@ -5716,7 +5760,7 @@ static inline void kstaled_scan_page(struct page *page, u8 *idle_page_age)
 		 */
 		if (!mapping && !PageSwapCache(page))
 			goto out;
-		else if (mapping_unevictable(mapping))
+		else if (mapping_unevictable(mapping, page))
 			goto out;
 		else if (PageSwapCache(page) ||
 			 mapping_cap_swap_backed(mapping))
@@ -5751,13 +5795,23 @@ static inline void kstaled_scan_page(struct page *page, u8 *idle_page_age)
 
 	/* Finally increment the correct statistic for this page. */
 	if (!(info.pr_flags & PR_DIRTY) &&
-	    !PageDirty(page) && !PageWriteback(page))
+	    !PageDirty(page) && !PageWriteback(page)) {
 		stats->idle_clean++;
-	else if (is_file)
+
+		if (mem->stale_page_age && age >= mem->stale_page_age) {
+			if (!PageStale(page) && !TestSetPageStale(page))
+				atomic_long_inc(&mem->stale_pages);
+			goto unlock_page_cgroup_out;
+		}
+	} else if (is_file)
 		stats->idle_dirty_file++;
 	else
 		stats->idle_dirty_swap++;
 
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&mem->stale_pages);
+
  unlock_page_cgroup_out:
 	unlock_page_cgroup(pc);
 
@@ -5767,6 +5821,28 @@ static inline void kstaled_scan_page(struct page *page, u8 *idle_page_age)
 	put_page(page);
 }
 
+void __set_page_nonstale(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+
+	/* Locate kstaled stats for the page's cgroup. */
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		return;
+	lock_page_cgroup(pc);
+	mem = pc->mem_cgroup;
+	if (!PageCgroupUsed(pc))
+		goto out;
+
+	/* Count page as non-stale */
+	if (TestClearPageStale(page))
+		atomic_long_dec(&mem->stale_pages);
+
+out:
+	unlock_page_cgroup(pc);
+}
+
 static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
diff --git a/mm/mlock.c b/mm/mlock.c
index 048260c..eac4c32 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -81,6 +81,7 @@ void mlock_vma_page(struct page *page)
 	BUG_ON(!PageLocked(page));
 
 	if (!TestSetPageMlocked(page)) {
+		set_page_nonstale(page);
 		inc_zone_page_state(page, NR_MLOCK);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 		if (!isolate_lru_page(page))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7bd9868..752fd21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3203,7 +3203,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 int page_evictable(struct page *page, struct vm_area_struct *vma)
 {
 
-	if (mapping_unevictable(page_mapping(page)))
+	if (mapping_unevictable(page_mapping(page), page))
 		return 0;
 
 	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 8/8] kstaled: add incrementally updating stale page count
@ 2011-09-17  3:39   ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  3:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Dave Hansen
  Cc: Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

Add an incrementally updating stale page count. A new per-cgroup
memory.stale_page_age file is introduced. After a non-zero number of scan
cycles is written there, pages that have been idle for at least that number
of cycles and are currently clean are reported in memory.idle_page_stats
as being stale. Contrary to the idle_*_clean statistic, this stale page
count is continually updated - hooks have been added to notice pages being
accessed or rendered unevictable, at which point the stale page count for
that cgroup is instantly decremented. The point is to allow userspace to
quickly respond to increased memory pressure.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/page-flags.h |   15 ++++++++
 include/linux/pagemap.h    |   11 ++++--
 mm/internal.h              |    1 +
 mm/memcontrol.c            |   82 ++++++++++++++++++++++++++++++++++++++++++--
 mm/mlock.c                 |    1 +
 mm/vmscan.c                |    2 +-
 6 files changed, 104 insertions(+), 8 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e964d98..22dbe90 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -58,6 +58,8 @@
  *
  * PG_idle indicates that the page has not been referenced since the last time
  * kstaled scanned it.
+ *
+ * PG_stale indicates that the page is currently counted as stale.
  */
 
 /*
@@ -117,6 +119,7 @@ enum pageflags {
 #ifdef CONFIG_KSTALED
 	PG_young,		/* kstaled cleared pte_young */
 	PG_idle,		/* idle since start of kstaled interval */
+	PG_stale,		/* page is counted as stale */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -293,21 +296,33 @@ PAGEFLAG_FALSE(HWPoison)
 
 PAGEFLAG(Young, young)
 PAGEFLAG(Idle, idle)
+PAGEFLAG(Stale, stale) TESTSCFLAG(Stale, stale)
+
+void __set_page_nonstale(struct page *page);
+
+static inline void set_page_nonstale(struct page *page)
+{
+	if (PageStale(page))
+		__set_page_nonstale(page);
+}
 
 static inline void set_page_young(struct page *page)
 {
+	set_page_nonstale(page);
 	if (!PageYoung(page))
 		SetPageYoung(page);
 }
 
 static inline void clear_page_idle(struct page *page)
 {
+	set_page_nonstale(page);
 	if (PageIdle(page))
 		ClearPageIdle(page);
 }
 
 #else /* !CONFIG_KSTALED */
 
+static inline void set_page_nonstale(struct page *page) {}
 static inline void set_page_young(struct page *page) {}
 static inline void clear_page_idle(struct page *page) {}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..693dd20 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -46,11 +46,14 @@ static inline void mapping_clear_unevictable(struct address_space *mapping)
 	clear_bit(AS_UNEVICTABLE, &mapping->flags);
 }
 
-static inline int mapping_unevictable(struct address_space *mapping)
+static inline int mapping_unevictable(struct address_space *mapping,
+				      struct page *page)
 {
-	if (mapping)
-		return test_bit(AS_UNEVICTABLE, &mapping->flags);
-	return !!mapping;
+	if (mapping && test_bit(AS_UNEVICTABLE, &mapping->flags)) {
+		set_page_nonstale(page);
+		return 1;
+	}
+	return 0;
 }
 
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
diff --git a/mm/internal.h b/mm/internal.h
index d071d38..d1cb0d6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -93,6 +93,7 @@ static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
 		return 0;
 
 	if (!TestSetPageMlocked(page)) {
+		set_page_nonstale(page);
 		inc_zone_page_state(page, NR_MLOCK);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ef406a1..da21830 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,6 +292,8 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 #ifdef CONFIG_KSTALED
+	int stale_page_age;
+
 	seqcount_t idle_page_stats_lock;
 	struct idle_page_stats {
 		unsigned long idle_clean;
@@ -299,6 +301,7 @@ struct mem_cgroup {
 		unsigned long idle_dirty_swap;
 	} idle_page_stats[NUM_KSTALED_BUCKETS],
 	  idle_scan_stats[NUM_KSTALED_BUCKETS];
+	atomic_long_t stale_pages;
 	unsigned long idle_page_scans;
 #endif
 };
@@ -2639,6 +2642,13 @@ static int mem_cgroup_move_account(struct page *page,
 		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
+
+#ifdef CONFIG_KSTALED
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&from->stale_pages);
+#endif
+
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
 		__mem_cgroup_cancel_charge(from, nr_pages);
@@ -3067,6 +3077,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 
 	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -nr_pages);
 
+#ifdef CONFIG_KSTALED
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&mem->stale_pages);
+#endif
+
 	ClearPageCgroupUsed(pc);
 	/*
 	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
@@ -4716,6 +4732,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
 		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
 	}
 	cb->fill(cb, "scans", scans);
+	cb->fill(cb, "stale",
+		 max(atomic_long_read(&mem->stale_pages), 0L) * PAGE_SIZE);
+
+	return 0;
+}
+
+static u64 mem_cgroup_stale_page_age_read(struct cgroup *cgrp,
+					  struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	return mem->stale_page_age;
+}
+
+static int mem_cgroup_stale_page_age_write(struct cgroup *cgrp,
+					   struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	if (val > 255)
+		return -EINVAL;
+
+	mem->stale_page_age = val;
 
 	return 0;
 }
@@ -4796,6 +4835,11 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "idle_page_stats",
 		.read_map = mem_cgroup_idle_page_stats_read,
 	},
+	{
+		.name = "stale_page_age",
+		.read_u64 = mem_cgroup_stale_page_age_read,
+		.write_u64 = mem_cgroup_stale_page_age_write,
+	},
 #endif
 };
 
@@ -5716,7 +5760,7 @@ static inline void kstaled_scan_page(struct page *page, u8 *idle_page_age)
 		 */
 		if (!mapping && !PageSwapCache(page))
 			goto out;
-		else if (mapping_unevictable(mapping))
+		else if (mapping_unevictable(mapping, page))
 			goto out;
 		else if (PageSwapCache(page) ||
 			 mapping_cap_swap_backed(mapping))
@@ -5751,13 +5795,23 @@ static inline void kstaled_scan_page(struct page *page, u8 *idle_page_age)
 
 	/* Finally increment the correct statistic for this page. */
 	if (!(info.pr_flags & PR_DIRTY) &&
-	    !PageDirty(page) && !PageWriteback(page))
+	    !PageDirty(page) && !PageWriteback(page)) {
 		stats->idle_clean++;
-	else if (is_file)
+
+		if (mem->stale_page_age && age >= mem->stale_page_age) {
+			if (!PageStale(page) && !TestSetPageStale(page))
+				atomic_long_inc(&mem->stale_pages);
+			goto unlock_page_cgroup_out;
+		}
+	} else if (is_file)
 		stats->idle_dirty_file++;
 	else
 		stats->idle_dirty_swap++;
 
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&mem->stale_pages);
+
  unlock_page_cgroup_out:
 	unlock_page_cgroup(pc);
 
@@ -5767,6 +5821,28 @@ static inline void kstaled_scan_page(struct page *page, u8 *idle_page_age)
 	put_page(page);
 }
 
+void __set_page_nonstale(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+
+	/* Locate kstaled stats for the page's cgroup. */
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		return;
+	lock_page_cgroup(pc);
+	mem = pc->mem_cgroup;
+	if (!PageCgroupUsed(pc))
+		goto out;
+
+	/* Count page as non-stale */
+	if (TestClearPageStale(page))
+		atomic_long_dec(&mem->stale_pages);
+
+out:
+	unlock_page_cgroup(pc);
+}
+
 static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
diff --git a/mm/mlock.c b/mm/mlock.c
index 048260c..eac4c32 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -81,6 +81,7 @@ void mlock_vma_page(struct page *page)
 	BUG_ON(!PageLocked(page));
 
 	if (!TestSetPageMlocked(page)) {
+		set_page_nonstale(page);
 		inc_zone_page_state(page, NR_MLOCK);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 		if (!isolate_lru_page(page))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7bd9868..752fd21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3203,7 +3203,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 int page_evictable(struct page *page, struct vm_area_struct *vma)
 {
 
-	if (mapping_unevictable(page_mapping(page)))
+	if (mapping_unevictable(page_mapping(page), page))
 		return 0;
 
 	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
  2011-09-17  3:39   ` Michel Lespinasse
@ 2011-09-17  3:44     ` Joe Perches
  -1 siblings, 0 replies; 54+ messages in thread
From: Joe Perches @ 2011-09-17  3:44 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

On Fri, 2011-09-16 at 20:39 -0700, Michel Lespinasse wrote:
> Introduce struct pr_info, passed into page_referenced() family of functions,

pr_info is a pretty commonly used function/macro.
Perhaps pageref_info instead?




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
@ 2011-09-17  3:44     ` Joe Perches
  0 siblings, 0 replies; 54+ messages in thread
From: Joe Perches @ 2011-09-17  3:44 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

On Fri, 2011-09-16 at 20:39 -0700, Michel Lespinasse wrote:
> Introduce struct pr_info, passed into page_referenced() family of functions,

pr_info is a pretty commonly used function/macro.
Perhaps pageref_info instead?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
  2011-09-17  3:44     ` Joe Perches
@ 2011-09-17  4:51       ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  4:51 UTC (permalink / raw)
  To: Joe Perches
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

On Fri, Sep 16, 2011 at 8:44 PM, Joe Perches <joe@perches.com> wrote:
> On Fri, 2011-09-16 at 20:39 -0700, Michel Lespinasse wrote:
>> Introduce struct pr_info, passed into page_referenced() family of functions,
>
> pr_info is a pretty commonly used function/macro.
> Perhaps pageref_info instead?

Hmm, you're right. I can see how people could find this confusing.
I'll make sure to change the name before this gets accepted.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
@ 2011-09-17  4:51       ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-17  4:51 UTC (permalink / raw)
  To: Joe Perches
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

On Fri, Sep 16, 2011 at 8:44 PM, Joe Perches <joe@perches.com> wrote:
> On Fri, 2011-09-16 at 20:39 -0700, Michel Lespinasse wrote:
>> Introduce struct pr_info, passed into page_referenced() family of functions,
>
> pr_info is a pretty commonly used function/macro.
> Perhaps pageref_info instead?

Hmm, you're right. I can see how people could find this confusing.
I'll make sure to change the name before this gets accepted.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
  2011-09-17  3:39   ` Michel Lespinasse
@ 2011-09-20 19:05     ` Rik van Riel
  -1 siblings, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2011-09-20 19:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

On 09/16/2011 11:39 PM, Michel Lespinasse wrote:
> Introduce struct pr_info, passed into page_referenced() family of functions,
> to represent information about the pte references that have been found for
> that page. Currently contains the vm_flags information as well as
> a PR_REFERENCED flag. The idea is to make it easy to extend the API
> with new flags.
>
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

I have to agree with Joe's suggested name change.

Other than that, this patch looks good (will ack the next version).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
@ 2011-09-20 19:05     ` Rik van Riel
  0 siblings, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2011-09-20 19:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

On 09/16/2011 11:39 PM, Michel Lespinasse wrote:
> Introduce struct pr_info, passed into page_referenced() family of functions,
> to represent information about the pte references that have been found for
> that page. Currently contains the vm_flags information as well as
> a PR_REFERENCED flag. The idea is to make it easy to extend the API
> with new flags.
>
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

I have to agree with Joe's suggested name change.

Other than that, this patch looks good (will ack the next version).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/8] kstaled: page_referenced_kstaled() and supporting infrastructure.
  2011-09-17  3:39   ` Michel Lespinasse
@ 2011-09-20 19:36     ` Peter Zijlstra
  -1 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2011-09-20 19:36 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf, rostedt

On Fri, 2011-09-16 at 20:39 -0700, Michel Lespinasse wrote:
> +PAGEFLAG(Young, young)

We should probably do something like the below, I couldn't figure out a
way to make it do multiple functions from one macro though so I picked
the simple PageFoo test.. 

I even added an Emacs variant, although I didn't test it..

---
 scripts/tags.sh |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/scripts/tags.sh b/scripts/tags.sh
index 75c5d24f1..b07797a 100755
--- a/scripts/tags.sh
+++ b/scripts/tags.sh
@@ -132,7 +132,8 @@ exuberant()
 	--regex-asm='/^ENTRY\(([^)]*)\).*/\1/'                  \
 	--regex-c='/^SYSCALL_DEFINE[[:digit:]]?\(([^,)]*).*/sys_\1/' \
 	--regex-c++='/^TRACE_EVENT\(([^,)]*).*/trace_\1/'		\
-	--regex-c++='/^DEFINE_EVENT\([^,)]*, *([^,)]*).*/trace_\1/'
+	--regex-c++='/^DEFINE_EVENT\([^,)]*, *([^,)]*).*/trace_\1/'	\
+	--regex-c++='/^PAGEFLAG\(([^,)]*).*/Page\1/'
 
 	all_kconfigs | xargs $1 -a                              \
 	--langdef=kconfig --language-force=kconfig              \
@@ -154,7 +155,8 @@ emacs()
 	--regex='/^ENTRY(\([^)]*\)).*/\1/'                      \
 	--regex='/^SYSCALL_DEFINE[0-9]?(\([^,)]*\).*/sys_\1/'   \
 	--regex='/^TRACE_EVENT(\([^,)]*\).*/trace_\1/'		\
-	--regex='/^DEFINE_EVENT([^,)]*, *\([^,)]*\).*/trace_\1/'
+	--regex='/^DEFINE_EVENT([^,)]*, *\([^,)]*\).*/trace_\1/'\
+	--regex='/^PAGEFLAG(\([^,)]*\).*/Page\1/'
 
 	all_kconfigs | xargs $1 -a                              \
 	--regex='/^[ \t]*\(\(menu\)*config\)[ \t]+\([a-zA-Z0-9_]+\)/\3/'


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/8] kstaled: page_referenced_kstaled() and supporting infrastructure.
@ 2011-09-20 19:36     ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2011-09-20 19:36 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf, rostedt

On Fri, 2011-09-16 at 20:39 -0700, Michel Lespinasse wrote:
> +PAGEFLAG(Young, young)

We should probably do something like the below, I couldn't figure out a
way to make it do multiple functions from one macro though so I picked
the simple PageFoo test.. 

I even added an Emacs variant, although I didn't test it..

---
 scripts/tags.sh |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/scripts/tags.sh b/scripts/tags.sh
index 75c5d24f1..b07797a 100755
--- a/scripts/tags.sh
+++ b/scripts/tags.sh
@@ -132,7 +132,8 @@ exuberant()
 	--regex-asm='/^ENTRY\(([^)]*)\).*/\1/'                  \
 	--regex-c='/^SYSCALL_DEFINE[[:digit:]]?\(([^,)]*).*/sys_\1/' \
 	--regex-c++='/^TRACE_EVENT\(([^,)]*).*/trace_\1/'		\
-	--regex-c++='/^DEFINE_EVENT\([^,)]*, *([^,)]*).*/trace_\1/'
+	--regex-c++='/^DEFINE_EVENT\([^,)]*, *([^,)]*).*/trace_\1/'	\
+	--regex-c++='/^PAGEFLAG\(([^,)]*).*/Page\1/'
 
 	all_kconfigs | xargs $1 -a                              \
 	--langdef=kconfig --language-force=kconfig              \
@@ -154,7 +155,8 @@ emacs()
 	--regex='/^ENTRY(\([^)]*\)).*/\1/'                      \
 	--regex='/^SYSCALL_DEFINE[0-9]?(\([^,)]*\).*/sys_\1/'   \
 	--regex='/^TRACE_EVENT(\([^,)]*\).*/trace_\1/'		\
-	--regex='/^DEFINE_EVENT([^,)]*, *\([^,)]*\).*/trace_\1/'
+	--regex='/^DEFINE_EVENT([^,)]*, *\([^,)]*\).*/trace_\1/'\
+	--regex='/^PAGEFLAG(\([^,)]*\).*/Page\1/'
 
 	all_kconfigs | xargs $1 -a                              \
 	--regex='/^[ \t]*\(\(menu\)*config\)[ \t]+\([a-zA-Z0-9_]+\)/\3/'

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/8] kstaled: documentation and config option.
  2011-09-17  3:39   ` Michel Lespinasse
@ 2011-09-20 21:23     ` Rik van Riel
  -1 siblings, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2011-09-20 21:23 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

On 09/16/2011 11:39 PM, Michel Lespinasse wrote:
> Extend memory cgroup documentation do describe the optional idle page
> tracking features, and add the corresponding configuration option.
>
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

(I'm going through these in order, and am assuming the
docs match the code from the later patches in the series)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/8] kstaled: documentation and config option.
@ 2011-09-20 21:23     ` Rik van Riel
  0 siblings, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2011-09-20 21:23 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

On 09/16/2011 11:39 PM, Michel Lespinasse wrote:
> Extend memory cgroup documentation do describe the optional idle page
> tracking features, and add the corresponding configuration option.
>
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

(I'm going through these in order, and am assuming the
docs match the code from the later patches in the series)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
  2011-09-20 19:05     ` Rik van Riel
@ 2011-09-21  2:51       ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-21  2:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

On Tue, Sep 20, 2011 at 12:05 PM, Rik van Riel <riel@redhat.com> wrote:
> I have to agree with Joe's suggested name change.
>
> Other than that, this patch looks good (will ack the next version).

Very sweet ! I'll make sure to send that out soon. I think it's
easiest if I wait for you to review the current patches first, though
? (I'll send an incremental diff along with the next patch series)

Thanks a lot for having a look.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info
@ 2011-09-21  2:51       ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-21  2:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

On Tue, Sep 20, 2011 at 12:05 PM, Rik van Riel <riel@redhat.com> wrote:
> I have to agree with Joe's suggested name change.
>
> Other than that, this patch looks good (will ack the next version).

Very sweet ! I'll make sure to send that out soon. I think it's
easiest if I wait for you to review the current patches first, though
? (I'll send an incremental diff along with the next patch series)

Thanks a lot for having a look.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-22 23:13   ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2011-09-22 23:13 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Fri, 16 Sep 2011 20:39:05 -0700
Michel Lespinasse <walken@google.com> wrote:

> Please comment on the following patches (which are against the v3.0 kernel).
> We are using these to collect memory utilization statistics for each cgroup
> accross many machines, and optimize job placement accordingly.

Please consider updating /proc/kpageflags with the three new page
flags.  If "yes": update.  If "no": explain/justify.

Which prompts the obvious: the whole feature could have been mostly
implemented in userspace, using kpageflags.  Some additional kernel
support would presumably be needed, but I'm not sure how much.

If you haven't already done so, please sketch down what that
infrastructure would look like and have a think about which approach is
preferable?



What bugs me a bit about the proposal is its cgroups-centricity.  The
question "how much memory is my application really using" comes up
again and again.  It predates cgroups.  One way to answer that question
is to force a massive amount of swapout on the entire machine, then let
the system recover and take a look at your app's RSS two minutes later.
This is very lame.

It's a legitimate requirement, and the kstaled infrastructure puts a
lot of things in place to answer it well.  But as far as I can tell it
doesn't quite get over the line.  Then again, maybe it _does_ get
there: put the application into a memcg all of its own, just for
instrumentation purposes and then use kstaled to monitor it?

<later> OK, I'm surprised to discover that kstaled is doing a physical
scan and not a virtual one.  I assume it works, but I don't know why. 
But it makes the above requirement harder, methinks.



How does all this code get along with hugepages, btw?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
@ 2011-09-22 23:13   ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2011-09-22 23:13 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Fri, 16 Sep 2011 20:39:05 -0700
Michel Lespinasse <walken@google.com> wrote:

> Please comment on the following patches (which are against the v3.0 kernel).
> We are using these to collect memory utilization statistics for each cgroup
> accross many machines, and optimize job placement accordingly.

Please consider updating /proc/kpageflags with the three new page
flags.  If "yes": update.  If "no": explain/justify.

Which prompts the obvious: the whole feature could have been mostly
implemented in userspace, using kpageflags.  Some additional kernel
support would presumably be needed, but I'm not sure how much.

If you haven't already done so, please sketch down what that
infrastructure would look like and have a think about which approach is
preferable?



What bugs me a bit about the proposal is its cgroups-centricity.  The
question "how much memory is my application really using" comes up
again and again.  It predates cgroups.  One way to answer that question
is to force a massive amount of swapout on the entire machine, then let
the system recover and take a look at your app's RSS two minutes later.
This is very lame.

It's a legitimate requirement, and the kstaled infrastructure puts a
lot of things in place to answer it well.  But as far as I can tell it
doesn't quite get over the line.  Then again, maybe it _does_ get
there: put the application into a memcg all of its own, just for
instrumentation purposes and then use kstaled to monitor it?

<later> OK, I'm surprised to discover that kstaled is doing a physical
scan and not a virtual one.  I assume it works, but I don't know why. 
But it makes the above requirement harder, methinks.



How does all this code get along with hugepages, btw?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/8] kstaled: minimalistic implementation.
  2011-09-17  3:39   ` Michel Lespinasse
@ 2011-09-22 23:14     ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2011-09-22 23:14 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Fri, 16 Sep 2011 20:39:09 -0700
Michel Lespinasse <walken@google.com> wrote:

> Introduce minimal kstaled implementation. The scan rate is controlled by
> /sys/kernel/mm/kstaled/scan_seconds and per-cgroup statistics are output
> into /dev/cgroup/*/memory.idle_page_stats.
> 
>
> ...
>
> @@ -4668,6 +4680,30 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
>  }
>  #endif /* CONFIG_NUMA */
>  
> +#ifdef CONFIG_KSTALED
> +static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
> +	struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);

nit: please prefer to use identifier "memcg" when referring to a mem_cgroup.

> +	unsigned int seqcount;
> +	struct idle_page_stats stats;
> +	unsigned long scans;
> +
> +	do {
> +		seqcount = read_seqcount_begin(&mem->idle_page_stats_lock);
> +		stats = mem->idle_page_stats;
> +		scans = mem->idle_page_scans;
> +	} while (read_seqcount_retry(&mem->idle_page_stats_lock, seqcount));
> +
> +	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
> +	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
> +	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);

So the user interface has units of bytes.  Was that documented
somewhere?  Is it worth bothering with?  getpagesize() exists...

(Actually, do we have a documentation update for the entire feature?)

> +	cb->fill(cb, "scans", scans);
> +
> +	return 0;
> +}
> +#endif /* CONFIG_KSTALED */
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
>
> ...
>
> @@ -5568,3 +5613,249 @@ static int __init enable_swap_account(char *s)
>  __setup("swapaccount=", enable_swap_account);
>  
>  #endif
> +
> +#ifdef CONFIG_KSTALED
> +
> +static unsigned int kstaled_scan_seconds;
> +static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
> +
> +static inline void kstaled_scan_page(struct page *page)

uninline this.  You may find that the compiler already uninlined it. 
Or it might inline it for you even if it wasn't declared inline.  gcc
does a decent job of optimizing this stuff for us and hints are often
unneeded.

> +{
> +	bool is_locked = false;
> +	bool is_file;
> +	struct pr_info info;
> +	struct page_cgroup *pc;
> +	struct idle_page_stats *stats;
> +
> +	/*
> +	 * Before taking the page reference, check if the page is
> +	 * a user page which is not obviously unreclaimable
> +	 * (we will do more complete checks later).
> +	 */
> +	if (!PageLRU(page) || PageMlocked(page) ||
> +	    (page->mapping == NULL && !PageSwapCache(page)))
> +		return;
> +
> +	if (!get_page_unless_zero(page))
> +		return;
> +
> +	/* Recheck now that we have the page reference. */
> +	if (unlikely(!PageLRU(page) || PageMlocked(page)))
> +		goto out;
> +
> +	/*
> +	 * Anon and SwapCache pages can be identified without locking.
> +	 * For all other cases, we need the page locked in order to
> +	 * dereference page->mapping.
> +	 */
> +	if (PageAnon(page) || PageSwapCache(page))
> +		is_file = false;
> +	else if (!trylock_page(page)) {
> +		/*
> +		 * We need to lock the page to dereference the mapping.
> +		 * But don't risk sleeping by calling lock_page().
> +		 * We don't want to stall kstaled, so we conservatively
> +		 * count locked pages as unreclaimable.
> +		 */

hm.  Pages are rarely locked for very long.  They aren't locked during
writeback.   I question the need for this?

> +		goto out;
> +	} else {
> +		struct address_space *mapping = page->mapping;
> +
> +		is_locked = true;
> +
> +		/*
> +		 * The page is still anon - it has been continuously referenced
> +		 * since the prior check.
> +		 */
> +		VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));

Really?  Are you sure that an elevated refcount is sufficient to
stabilise both of these?

> +		/*
> +		 * Check the mapping under protection of the page lock.
> +		 * 1. If the page is not swap cache and has no mapping,
> +		 *    shrink_page_list can't do anything with it.
> +		 * 2. If the mapping is unevictable (as in SHM_LOCK segments),
> +		 *    shrink_page_list can't do anything with it.
> +		 * 3. If the page is swap cache or the mapping is swap backed
> +		 *    (as in shmem), consider it a swappable page.
> +		 * 4. If the backing dev has indicated that it does not want
> +		 *    its pages sync'd to disk (as in ramfs), take this as
> +		 *    a hint that its pages are not reclaimable.
> +		 * 5. Otherwise, consider this as a file page reclaimable
> +		 *    through standard pageout.
> +		 */
> +		if (!mapping && !PageSwapCache(page))
> +			goto out;
> +		else if (mapping_unevictable(mapping))
> +			goto out;
> +		else if (PageSwapCache(page) ||
> +			 mapping_cap_swap_backed(mapping))
> +			is_file = false;
> +		else if (!mapping_cap_writeback_dirty(mapping))
> +			goto out;
> +		else
> +			is_file = true;
> +	}
> +
> +	/* Find out if the page is idle. Also test for pending mlock. */
> +	page_referenced_kstaled(page, is_locked, &info);
> +	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
> +		goto out;
> +
> +	/* Locate kstaled stats for the page's cgroup. */
> +	pc = lookup_page_cgroup(page);
> +	if (!pc)
> +		goto out;
> +	lock_page_cgroup(pc);
> +	if (!PageCgroupUsed(pc))
> +		goto unlock_page_cgroup_out;
> +	stats = &pc->mem_cgroup->idle_scan_stats;
> +
> +	/* Finally increment the correct statistic for this page. */
> +	if (!(info.pr_flags & PR_DIRTY) &&
> +	    !PageDirty(page) && !PageWriteback(page))
> +		stats->idle_clean++;
> +	else if (is_file)
> +		stats->idle_dirty_file++;
> +	else
> +		stats->idle_dirty_swap++;
> +
> + unlock_page_cgroup_out:
> +	unlock_page_cgroup(pc);
> +
> + out:
> +	if (is_locked)
> +		unlock_page(page);
> +	put_page(page);
> +}
> +
> +static void kstaled_scan_node(pg_data_t *pgdat)
> +{
> +	unsigned long flags;
> +	unsigned long start, end, pfn;
> +
> +	pgdat_resize_lock(pgdat, &flags);
> +
> +	start = pgdat->node_start_pfn;
> +	end = start + pgdat->node_spanned_pages;
> +
> +	for (pfn = start; pfn < end; pfn++) {
> +		if (need_resched()) {
> +			pgdat_resize_unlock(pgdat, &flags);
> +			cond_resched();
> +			pgdat_resize_lock(pgdat, &flags);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +			/* abort if the node got resized */
> +			if (pfn < pgdat->node_start_pfn ||
> +			    end > (pgdat->node_start_pfn +
> +				   pgdat->node_spanned_pages))
> +				goto abort;
> +#endif
> +		}
> +
> +		if (!pfn_valid(pfn))
> +			continue;
> +
> +		kstaled_scan_page(pfn_to_page(pfn));
> +	}
> +
> +abort:
> +	pgdat_resize_unlock(pgdat, &flags);
> +}
> +
> +static int kstaled(void *dummy)
> +{
> +	while (1) {
> +		int scan_seconds;
> +		int nid;
> +		struct mem_cgroup *mem;
> +
> +		wait_event_interruptible(kstaled_wait,
> +				 (scan_seconds = kstaled_scan_seconds) > 0);
> +		/*
> +		 * We use interruptible wait_event so as not to contribute
> +		 * to the machine load average while we're sleeping.
> +		 * However, we don't actually expect to receive a signal
> +		 * since we run as a kernel thread, so the condition we were
> +		 * waiting for should be true once we get here.
> +		 */
> +		BUG_ON(scan_seconds <= 0);
> +
> +		for_each_mem_cgroup_all(mem)
> +			memset(&mem->idle_scan_stats, 0,
> +			       sizeof(mem->idle_scan_stats));
> +
> +		for_each_node_state(nid, N_HIGH_MEMORY)
> +			kstaled_scan_node(NODE_DATA(nid));
> +
> +		for_each_mem_cgroup_all(mem) {
> +			write_seqcount_begin(&mem->idle_page_stats_lock);
> +			mem->idle_page_stats = mem->idle_scan_stats;
> +			mem->idle_page_scans++;
> +			write_seqcount_end(&mem->idle_page_stats_lock);
> +		}
> +
> +		schedule_timeout_interruptible(scan_seconds * HZ);
> +	}
> +
> +	BUG();
> +	return 0;	/* NOT REACHED */
> +}

OK, I'm really confused.

Take a minimal machine with a single node which contains one zone.

AFAICT this code will measure the number of idle pages in that zone and
then will attribute that number into *every* cgroup in the system. 
With no discrimination between them.  So it really provided no useful
information at all.

I was quite surprised to see a physical page scan!  I'd have expected
kstaled to be doing pte tree walks.


> +static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
> +					 struct kobj_attribute *attr,
> +					 char *buf)
> +{
> +	return sprintf(buf, "%u\n", kstaled_scan_seconds);
> +}
> +
> +static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
> +					  struct kobj_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long input;
> +
> +	err = strict_strtoul(buf, 10, &input);

Please use the new kstrto*() interfaces when merging up to mainline.

> +	if (err)
> +		return -EINVAL;
> +	kstaled_scan_seconds = input;
> +	wake_up_interruptible(&kstaled_wait);
> +	return count;
> +}
> +
>
> ...
>
> +static int __init kstaled_init(void)
> +{
> +	int error;
> +	struct task_struct *thread;
> +
> +	error = sysfs_create_group(mm_kobj, &kstaled_attr_group);
> +	if (error) {
> +		pr_err("Failed to create kstaled sysfs node\n");
> +		return error;
> +	}
> +
> +	thread = kthread_run(kstaled, NULL, "kstaled");
> +	if (IS_ERR(thread)) {
> +		pr_err("Failed to start kstaled\n");
> +		return PTR_ERR(thread);
> +	}
> +
> +	return 0;
> +}

I wonder if one thread machine-wide will be sufficient.  We might end
up with per-nice threads, for example.  Like kswapd.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/8] kstaled: minimalistic implementation.
@ 2011-09-22 23:14     ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2011-09-22 23:14 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Fri, 16 Sep 2011 20:39:09 -0700
Michel Lespinasse <walken@google.com> wrote:

> Introduce minimal kstaled implementation. The scan rate is controlled by
> /sys/kernel/mm/kstaled/scan_seconds and per-cgroup statistics are output
> into /dev/cgroup/*/memory.idle_page_stats.
> 
>
> ...
>
> @@ -4668,6 +4680,30 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
>  }
>  #endif /* CONFIG_NUMA */
>  
> +#ifdef CONFIG_KSTALED
> +static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
> +	struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);

nit: please prefer to use identifier "memcg" when referring to a mem_cgroup.

> +	unsigned int seqcount;
> +	struct idle_page_stats stats;
> +	unsigned long scans;
> +
> +	do {
> +		seqcount = read_seqcount_begin(&mem->idle_page_stats_lock);
> +		stats = mem->idle_page_stats;
> +		scans = mem->idle_page_scans;
> +	} while (read_seqcount_retry(&mem->idle_page_stats_lock, seqcount));
> +
> +	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
> +	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
> +	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);

So the user interface has units of bytes.  Was that documented
somewhere?  Is it worth bothering with?  getpagesize() exists...

(Actually, do we have a documentation update for the entire feature?)

> +	cb->fill(cb, "scans", scans);
> +
> +	return 0;
> +}
> +#endif /* CONFIG_KSTALED */
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
>
> ...
>
> @@ -5568,3 +5613,249 @@ static int __init enable_swap_account(char *s)
>  __setup("swapaccount=", enable_swap_account);
>  
>  #endif
> +
> +#ifdef CONFIG_KSTALED
> +
> +static unsigned int kstaled_scan_seconds;
> +static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
> +
> +static inline void kstaled_scan_page(struct page *page)

uninline this.  You may find that the compiler already uninlined it. 
Or it might inline it for you even if it wasn't declared inline.  gcc
does a decent job of optimizing this stuff for us and hints are often
unneeded.

> +{
> +	bool is_locked = false;
> +	bool is_file;
> +	struct pr_info info;
> +	struct page_cgroup *pc;
> +	struct idle_page_stats *stats;
> +
> +	/*
> +	 * Before taking the page reference, check if the page is
> +	 * a user page which is not obviously unreclaimable
> +	 * (we will do more complete checks later).
> +	 */
> +	if (!PageLRU(page) || PageMlocked(page) ||
> +	    (page->mapping == NULL && !PageSwapCache(page)))
> +		return;
> +
> +	if (!get_page_unless_zero(page))
> +		return;
> +
> +	/* Recheck now that we have the page reference. */
> +	if (unlikely(!PageLRU(page) || PageMlocked(page)))
> +		goto out;
> +
> +	/*
> +	 * Anon and SwapCache pages can be identified without locking.
> +	 * For all other cases, we need the page locked in order to
> +	 * dereference page->mapping.
> +	 */
> +	if (PageAnon(page) || PageSwapCache(page))
> +		is_file = false;
> +	else if (!trylock_page(page)) {
> +		/*
> +		 * We need to lock the page to dereference the mapping.
> +		 * But don't risk sleeping by calling lock_page().
> +		 * We don't want to stall kstaled, so we conservatively
> +		 * count locked pages as unreclaimable.
> +		 */

hm.  Pages are rarely locked for very long.  They aren't locked during
writeback.   I question the need for this?

> +		goto out;
> +	} else {
> +		struct address_space *mapping = page->mapping;
> +
> +		is_locked = true;
> +
> +		/*
> +		 * The page is still anon - it has been continuously referenced
> +		 * since the prior check.
> +		 */
> +		VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));

Really?  Are you sure that an elevated refcount is sufficient to
stabilise both of these?

> +		/*
> +		 * Check the mapping under protection of the page lock.
> +		 * 1. If the page is not swap cache and has no mapping,
> +		 *    shrink_page_list can't do anything with it.
> +		 * 2. If the mapping is unevictable (as in SHM_LOCK segments),
> +		 *    shrink_page_list can't do anything with it.
> +		 * 3. If the page is swap cache or the mapping is swap backed
> +		 *    (as in shmem), consider it a swappable page.
> +		 * 4. If the backing dev has indicated that it does not want
> +		 *    its pages sync'd to disk (as in ramfs), take this as
> +		 *    a hint that its pages are not reclaimable.
> +		 * 5. Otherwise, consider this as a file page reclaimable
> +		 *    through standard pageout.
> +		 */
> +		if (!mapping && !PageSwapCache(page))
> +			goto out;
> +		else if (mapping_unevictable(mapping))
> +			goto out;
> +		else if (PageSwapCache(page) ||
> +			 mapping_cap_swap_backed(mapping))
> +			is_file = false;
> +		else if (!mapping_cap_writeback_dirty(mapping))
> +			goto out;
> +		else
> +			is_file = true;
> +	}
> +
> +	/* Find out if the page is idle. Also test for pending mlock. */
> +	page_referenced_kstaled(page, is_locked, &info);
> +	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
> +		goto out;
> +
> +	/* Locate kstaled stats for the page's cgroup. */
> +	pc = lookup_page_cgroup(page);
> +	if (!pc)
> +		goto out;
> +	lock_page_cgroup(pc);
> +	if (!PageCgroupUsed(pc))
> +		goto unlock_page_cgroup_out;
> +	stats = &pc->mem_cgroup->idle_scan_stats;
> +
> +	/* Finally increment the correct statistic for this page. */
> +	if (!(info.pr_flags & PR_DIRTY) &&
> +	    !PageDirty(page) && !PageWriteback(page))
> +		stats->idle_clean++;
> +	else if (is_file)
> +		stats->idle_dirty_file++;
> +	else
> +		stats->idle_dirty_swap++;
> +
> + unlock_page_cgroup_out:
> +	unlock_page_cgroup(pc);
> +
> + out:
> +	if (is_locked)
> +		unlock_page(page);
> +	put_page(page);
> +}
> +
> +static void kstaled_scan_node(pg_data_t *pgdat)
> +{
> +	unsigned long flags;
> +	unsigned long start, end, pfn;
> +
> +	pgdat_resize_lock(pgdat, &flags);
> +
> +	start = pgdat->node_start_pfn;
> +	end = start + pgdat->node_spanned_pages;
> +
> +	for (pfn = start; pfn < end; pfn++) {
> +		if (need_resched()) {
> +			pgdat_resize_unlock(pgdat, &flags);
> +			cond_resched();
> +			pgdat_resize_lock(pgdat, &flags);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +			/* abort if the node got resized */
> +			if (pfn < pgdat->node_start_pfn ||
> +			    end > (pgdat->node_start_pfn +
> +				   pgdat->node_spanned_pages))
> +				goto abort;
> +#endif
> +		}
> +
> +		if (!pfn_valid(pfn))
> +			continue;
> +
> +		kstaled_scan_page(pfn_to_page(pfn));
> +	}
> +
> +abort:
> +	pgdat_resize_unlock(pgdat, &flags);
> +}
> +
> +static int kstaled(void *dummy)
> +{
> +	while (1) {
> +		int scan_seconds;
> +		int nid;
> +		struct mem_cgroup *mem;
> +
> +		wait_event_interruptible(kstaled_wait,
> +				 (scan_seconds = kstaled_scan_seconds) > 0);
> +		/*
> +		 * We use interruptible wait_event so as not to contribute
> +		 * to the machine load average while we're sleeping.
> +		 * However, we don't actually expect to receive a signal
> +		 * since we run as a kernel thread, so the condition we were
> +		 * waiting for should be true once we get here.
> +		 */
> +		BUG_ON(scan_seconds <= 0);
> +
> +		for_each_mem_cgroup_all(mem)
> +			memset(&mem->idle_scan_stats, 0,
> +			       sizeof(mem->idle_scan_stats));
> +
> +		for_each_node_state(nid, N_HIGH_MEMORY)
> +			kstaled_scan_node(NODE_DATA(nid));
> +
> +		for_each_mem_cgroup_all(mem) {
> +			write_seqcount_begin(&mem->idle_page_stats_lock);
> +			mem->idle_page_stats = mem->idle_scan_stats;
> +			mem->idle_page_scans++;
> +			write_seqcount_end(&mem->idle_page_stats_lock);
> +		}
> +
> +		schedule_timeout_interruptible(scan_seconds * HZ);
> +	}
> +
> +	BUG();
> +	return 0;	/* NOT REACHED */
> +}

OK, I'm really confused.

Take a minimal machine with a single node which contains one zone.

AFAICT this code will measure the number of idle pages in that zone and
then will attribute that number into *every* cgroup in the system. 
With no discrimination between them.  So it really provided no useful
information at all.

I was quite surprised to see a physical page scan!  I'd have expected
kstaled to be doing pte tree walks.


> +static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
> +					 struct kobj_attribute *attr,
> +					 char *buf)
> +{
> +	return sprintf(buf, "%u\n", kstaled_scan_seconds);
> +}
> +
> +static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
> +					  struct kobj_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long input;
> +
> +	err = strict_strtoul(buf, 10, &input);

Please use the new kstrto*() interfaces when merging up to mainline.

> +	if (err)
> +		return -EINVAL;
> +	kstaled_scan_seconds = input;
> +	wake_up_interruptible(&kstaled_wait);
> +	return count;
> +}
> +
>
> ...
>
> +static int __init kstaled_init(void)
> +{
> +	int error;
> +	struct task_struct *thread;
> +
> +	error = sysfs_create_group(mm_kobj, &kstaled_attr_group);
> +	if (error) {
> +		pr_err("Failed to create kstaled sysfs node\n");
> +		return error;
> +	}
> +
> +	thread = kthread_run(kstaled, NULL, "kstaled");
> +	if (IS_ERR(thread)) {
> +		pr_err("Failed to start kstaled\n");
> +		return PTR_ERR(thread);
> +	}
> +
> +	return 0;
> +}

I wonder if one thread machine-wide will be sufficient.  We might end
up with per-nice threads, for example.  Like kswapd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 6/8] kstaled: rate limit pages scanned per second.
  2011-09-17  3:39   ` Michel Lespinasse
@ 2011-09-22 23:15     ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2011-09-22 23:15 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Fri, 16 Sep 2011 20:39:11 -0700
Michel Lespinasse <walken@google.com> wrote:

> Scan some number of pages from each node every second, instead of trying to
> scan the entime memory at once and being idle for the rest of the configured
> interval.

Well...  why?  The amount of work done per scan interval is the same
(actually, it will be slightly increased due to cache evictions).

I think we should see a good explanation of what observed problem this
hackery^Wtweak is trying to solve.  Once that is revealed, we can
compare the proposed solution with one based on thread policy/priority
(for example).

>
> ....
>
> @@ -5788,21 +5800,60 @@ static int kstaled(void *dummy)
>  		 */
>  		BUG_ON(scan_seconds <= 0);
>  
> -		for_each_mem_cgroup_all(mem)
> -			memset(&mem->idle_scan_stats, 0,
> -			       sizeof(mem->idle_scan_stats));
> +		earlier = jiffies;
>  
> +		scan_done = true;
>  		for_each_node_state(nid, N_HIGH_MEMORY)
> -			kstaled_scan_node(NODE_DATA(nid));
> +			scan_done &= kstaled_scan_node(NODE_DATA(nid),
> +						       scan_seconds, reset);
> +
> +		if (scan_done) {
> +			struct mem_cgroup *mem;
> +
> +			for_each_mem_cgroup_all(mem) {
> +				write_seqcount_begin(&mem->idle_page_stats_lock);
> +				mem->idle_page_stats = mem->idle_scan_stats;
> +				mem->idle_page_scans++;
> +				write_seqcount_end(&mem->idle_page_stats_lock);
> +				memset(&mem->idle_scan_stats, 0,
> +				       sizeof(mem->idle_scan_stats));
> +			}
> +		}
>  
> -		for_each_mem_cgroup_all(mem) {
> -			write_seqcount_begin(&mem->idle_page_stats_lock);
> -			mem->idle_page_stats = mem->idle_scan_stats;
> -			mem->idle_page_scans++;
> -			write_seqcount_end(&mem->idle_page_stats_lock);
> +		delta = jiffies - earlier;
> +		if (delta < HZ / 2) {
> +			delayed = 0;
> +			schedule_timeout_interruptible(HZ - delta);
> +		} else {
> +			/*
> +			 * Emergency throttle if we're taking too long.
> +			 * We are supposed to scan an entire slice in 1 second.
> +			 * If we keep taking longer for 10 consecutive times,
> +			 * scale back our scan_seconds.
> +			 *
> +			 * If someone changed kstaled_scan_seconds while we
> +			 * were running, hope they know what they're doing and
> +			 * assume they've eliminated any delays.
> +			 */
> +			bool updated = false;
> +			spin_lock(&kstaled_scan_seconds_lock);
> +			if (scan_seconds != kstaled_scan_seconds)
> +				delayed = 0;
> +			else if (++delayed == 10) {
> +				delayed = 0;
> +				scan_seconds *= 2;
> +				kstaled_scan_seconds = scan_seconds;
> +				updated = true;
> +			}
> +			spin_unlock(&kstaled_scan_seconds_lock);
> +			if (updated)
> +				pr_warning("kstaled taking too long, "
> +					   "scan_seconds now %d\n",
> +					   scan_seconds);
> +			schedule_timeout_interruptible(HZ / 2);

This is all rather unpleasing.

>  
> -		schedule_timeout_interruptible(scan_seconds * HZ);
> +		reset = scan_done;
>  	}
>  


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 6/8] kstaled: rate limit pages scanned per second.
@ 2011-09-22 23:15     ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2011-09-22 23:15 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Fri, 16 Sep 2011 20:39:11 -0700
Michel Lespinasse <walken@google.com> wrote:

> Scan some number of pages from each node every second, instead of trying to
> scan the entime memory at once and being idle for the rest of the configured
> interval.

Well...  why?  The amount of work done per scan interval is the same
(actually, it will be slightly increased due to cache evictions).

I think we should see a good explanation of what observed problem this
hackery^Wtweak is trying to solve.  Once that is revealed, we can
compare the proposed solution with one based on thread policy/priority
(for example).

>
> ....
>
> @@ -5788,21 +5800,60 @@ static int kstaled(void *dummy)
>  		 */
>  		BUG_ON(scan_seconds <= 0);
>  
> -		for_each_mem_cgroup_all(mem)
> -			memset(&mem->idle_scan_stats, 0,
> -			       sizeof(mem->idle_scan_stats));
> +		earlier = jiffies;
>  
> +		scan_done = true;
>  		for_each_node_state(nid, N_HIGH_MEMORY)
> -			kstaled_scan_node(NODE_DATA(nid));
> +			scan_done &= kstaled_scan_node(NODE_DATA(nid),
> +						       scan_seconds, reset);
> +
> +		if (scan_done) {
> +			struct mem_cgroup *mem;
> +
> +			for_each_mem_cgroup_all(mem) {
> +				write_seqcount_begin(&mem->idle_page_stats_lock);
> +				mem->idle_page_stats = mem->idle_scan_stats;
> +				mem->idle_page_scans++;
> +				write_seqcount_end(&mem->idle_page_stats_lock);
> +				memset(&mem->idle_scan_stats, 0,
> +				       sizeof(mem->idle_scan_stats));
> +			}
> +		}
>  
> -		for_each_mem_cgroup_all(mem) {
> -			write_seqcount_begin(&mem->idle_page_stats_lock);
> -			mem->idle_page_stats = mem->idle_scan_stats;
> -			mem->idle_page_scans++;
> -			write_seqcount_end(&mem->idle_page_stats_lock);
> +		delta = jiffies - earlier;
> +		if (delta < HZ / 2) {
> +			delayed = 0;
> +			schedule_timeout_interruptible(HZ - delta);
> +		} else {
> +			/*
> +			 * Emergency throttle if we're taking too long.
> +			 * We are supposed to scan an entire slice in 1 second.
> +			 * If we keep taking longer for 10 consecutive times,
> +			 * scale back our scan_seconds.
> +			 *
> +			 * If someone changed kstaled_scan_seconds while we
> +			 * were running, hope they know what they're doing and
> +			 * assume they've eliminated any delays.
> +			 */
> +			bool updated = false;
> +			spin_lock(&kstaled_scan_seconds_lock);
> +			if (scan_seconds != kstaled_scan_seconds)
> +				delayed = 0;
> +			else if (++delayed == 10) {
> +				delayed = 0;
> +				scan_seconds *= 2;
> +				kstaled_scan_seconds = scan_seconds;
> +				updated = true;
> +			}
> +			spin_unlock(&kstaled_scan_seconds_lock);
> +			if (updated)
> +				pr_warning("kstaled taking too long, "
> +					   "scan_seconds now %d\n",
> +					   scan_seconds);
> +			schedule_timeout_interruptible(HZ / 2);

This is all rather unpleasing.

>  
> -		schedule_timeout_interruptible(scan_seconds * HZ);
> +		reset = scan_done;
>  	}
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] kstaled: add histogram sampling functionality
  2011-09-17  3:39   ` Michel Lespinasse
@ 2011-09-22 23:15     ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2011-09-22 23:15 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Fri, 16 Sep 2011 20:39:12 -0700
Michel Lespinasse <walken@google.com> wrote:

> add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
> 240 scan intervals into /dev/cgroup/*/memory.idle_page_stats

Why?  What's the use case for this feature?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] kstaled: add histogram sampling functionality
@ 2011-09-22 23:15     ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2011-09-22 23:15 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Fri, 16 Sep 2011 20:39:12 -0700
Michel Lespinasse <walken@google.com> wrote:

> add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
> 240 scan intervals into /dev/cgroup/*/memory.idle_page_stats

Why?  What's the use case for this feature?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
  2011-09-22 23:13   ` Andrew Morton
@ 2011-09-23  1:23     ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-23  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Thu, Sep 22, 2011 at 4:13 PM, Andrew Morton <akpm@google.com> wrote:
> On Fri, 16 Sep 2011 20:39:05 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> Please comment on the following patches (which are against the v3.0 kernel).
>> We are using these to collect memory utilization statistics for each cgroup
>> accross many machines, and optimize job placement accordingly.
>
> Please consider updating /proc/kpageflags with the three new page
> flags.  If "yes": update.  If "no": explain/justify.

The PG_stale flag should probably be exported that way. I'll make sure
to add this, thanks for the suggestion!

I am not sure about PG_young and PG_idle since they indicate young
bits have been cleared in PTE(s) pointing to the page since the last
page_referenced() call. This seems rather internal - we don't export
PTE young bits in kpageflags currently, nor do we export anything that
would depend on when page_referenced() was last called.

> Which prompts the obvious: the whole feature could have been mostly
> implemented in userspace, using kpageflags.  Some additional kernel
> support would presumably be needed, but I'm not sure how much.
>
> If you haven't already done so, please sketch down what that
> infrastructure would look like and have a think about which approach is
> preferable?

kpageflags does not currently do a page_referenced() call to export
PTE young flags. For a userspace approach, we would have to add that.
Also we would want to actually clear the PTE young bits so that the
page doesn't show up as young again on the next kpageflags read - and,
we wouldn't want to affect the normal LRU algorithms while doing this,
so we'd end up introducing the same PG_young and PG_idle flags. The
next issue would be to find out which cgroup an idle page belongs to -
this could be done by adding a new kpagecgroup file, I suppose. Given
the above, we'd have the necessary components for a userspace approach
- but, the only part that we would really be able to remove from the
kernel side is the loop that scans physical pages and tallies the idle
ones into a per-cgroup count.

> What bugs me a bit about the proposal is its cgroups-centricity.  The
> question "how much memory is my application really using" comes up
> again and again.  It predates cgroups.  One way to answer that question
> is to force a massive amount of swapout on the entire machine, then let
> the system recover and take a look at your app's RSS two minutes later.
> This is very lame.
>
> It's a legitimate requirement, and the kstaled infrastructure puts a
> lot of things in place to answer it well.  But as far as I can tell it
> doesn't quite get over the line.  Then again, maybe it _does_ get
> there: put the application into a memcg all of its own, just for
> instrumentation purposes and then use kstaled to monitor it?

Yes, this is what I would recomment in this situation - create a
memory cgroup to move the application in, and see what kstaled
reports.

> <later> OK, I'm surprised to discover that kstaled is doing a physical
> scan and not a virtual one.  I assume it works, but I don't know why.
> But it makes the above requirement harder, methinks.

The reason for the physical scan is that a virtual scan would have
some limitations:
- it would only report memory that's virtually mapped - we do want
file pages to be classified as idle or not, regardless of how the file
gets accessed
- it may not work well with jobs that involve short lived processes.

> How does all this code get along with hugepages, btw?

They should get along now that Andreas updated get_page and
get_page_unless_zero to avoid the race with THP tail page splitting.

However, you're reminding me that I forgot to include the patch that
would make the accounting correct when we encounter a THP page (we
want to report the entire page as idle rather than just the first 4K,
and increment pfn appropriately for the page size)...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
@ 2011-09-23  1:23     ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-23  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Thu, Sep 22, 2011 at 4:13 PM, Andrew Morton <akpm@google.com> wrote:
> On Fri, 16 Sep 2011 20:39:05 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> Please comment on the following patches (which are against the v3.0 kernel).
>> We are using these to collect memory utilization statistics for each cgroup
>> accross many machines, and optimize job placement accordingly.
>
> Please consider updating /proc/kpageflags with the three new page
> flags.  If "yes": update.  If "no": explain/justify.

The PG_stale flag should probably be exported that way. I'll make sure
to add this, thanks for the suggestion!

I am not sure about PG_young and PG_idle since they indicate young
bits have been cleared in PTE(s) pointing to the page since the last
page_referenced() call. This seems rather internal - we don't export
PTE young bits in kpageflags currently, nor do we export anything that
would depend on when page_referenced() was last called.

> Which prompts the obvious: the whole feature could have been mostly
> implemented in userspace, using kpageflags.  Some additional kernel
> support would presumably be needed, but I'm not sure how much.
>
> If you haven't already done so, please sketch down what that
> infrastructure would look like and have a think about which approach is
> preferable?

kpageflags does not currently do a page_referenced() call to export
PTE young flags. For a userspace approach, we would have to add that.
Also we would want to actually clear the PTE young bits so that the
page doesn't show up as young again on the next kpageflags read - and,
we wouldn't want to affect the normal LRU algorithms while doing this,
so we'd end up introducing the same PG_young and PG_idle flags. The
next issue would be to find out which cgroup an idle page belongs to -
this could be done by adding a new kpagecgroup file, I suppose. Given
the above, we'd have the necessary components for a userspace approach
- but, the only part that we would really be able to remove from the
kernel side is the loop that scans physical pages and tallies the idle
ones into a per-cgroup count.

> What bugs me a bit about the proposal is its cgroups-centricity.  The
> question "how much memory is my application really using" comes up
> again and again.  It predates cgroups.  One way to answer that question
> is to force a massive amount of swapout on the entire machine, then let
> the system recover and take a look at your app's RSS two minutes later.
> This is very lame.
>
> It's a legitimate requirement, and the kstaled infrastructure puts a
> lot of things in place to answer it well.  But as far as I can tell it
> doesn't quite get over the line.  Then again, maybe it _does_ get
> there: put the application into a memcg all of its own, just for
> instrumentation purposes and then use kstaled to monitor it?

Yes, this is what I would recomment in this situation - create a
memory cgroup to move the application in, and see what kstaled
reports.

> <later> OK, I'm surprised to discover that kstaled is doing a physical
> scan and not a virtual one.  I assume it works, but I don't know why.
> But it makes the above requirement harder, methinks.

The reason for the physical scan is that a virtual scan would have
some limitations:
- it would only report memory that's virtually mapped - we do want
file pages to be classified as idle or not, regardless of how the file
gets accessed
- it may not work well with jobs that involve short lived processes.

> How does all this code get along with hugepages, btw?

They should get along now that Andreas updated get_page and
get_page_unless_zero to avoid the race with THP tail page splitting.

However, you're reminding me that I forgot to include the patch that
would make the accounting correct when we encounter a THP page (we
want to report the entire page as idle rather than just the first 4K,
and increment pfn appropriately for the page size)...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/8] kstaled: minimalistic implementation.
  2011-09-22 23:14     ` Andrew Morton
@ 2011-09-23  8:37       ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-23  8:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Thu, Sep 22, 2011 at 4:14 PM, Andrew Morton <akpm@google.com> wrote:
> nit: please prefer to use identifier "memcg" when referring to a mem_cgroup.

OK. Done in my tree, will resend it shortly.

>> +     cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
>> +     cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
>> +     cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
>
> So the user interface has units of bytes.  Was that documented
> somewhere?  Is it worth bothering with?  getpagesize() exists...

This is consistent with existing usage in memory.stat for example. I
think bytes is a good default unit, but I could be convinced to add
_in_bytes to all fields if you think that's needed.

> (Actually, do we have a documentation update for the entire feature?)

Patch 2 in the series augments Documentation/cgroups/memory.txt

>> +static inline void kstaled_scan_page(struct page *page)
>
> uninline this.  You may find that the compiler already uninlined it.
> Or it might inline it for you even if it wasn't declared inline.  gcc
> does a decent job of optimizing this stuff for us and hints are often
> unneeded.

I tend to manually inline functions that have one single call site.
Some time ago the compilers weren't smart about this, but I suppose
they might have improved. I don't care very strongly either way so
I'll just uninline it as suggested.

>> +     else if (!trylock_page(page)) {
>> +             /*
>> +              * We need to lock the page to dereference the mapping.
>> +              * But don't risk sleeping by calling lock_page().
>> +              * We don't want to stall kstaled, so we conservatively
>> +              * count locked pages as unreclaimable.
>> +              */
>
> hm.  Pages are rarely locked for very long.  They aren't locked during
> writeback.   I question the need for this?

Pages are locked during hard page faults; this is IMO sufficient
reason for the above code.

>> +     } else {
>> +             struct address_space *mapping = page->mapping;
>> +
>> +             is_locked = true;
>> +
>> +             /*
>> +              * The page is still anon - it has been continuously referenced
>> +              * since the prior check.
>> +              */
>> +             VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
>
> Really?  Are you sure that an elevated refcount is sufficient to
> stabilise both of these?

The elevated refcount stabilizes PageAnon().

The mapping is stable only after the page has been locked; note that
page->mapping was read after the page was locked. Essentially I'm
asserting that page_rmapping(page) == page->mapping, which is true for
non-anon pages.

>> +static int kstaled(void *dummy)
>> +{
>> +     while (1) {
>> +             int scan_seconds;
>> +             int nid;
>> +             struct mem_cgroup *mem;
>> +
>> +             wait_event_interruptible(kstaled_wait,
>> +                              (scan_seconds = kstaled_scan_seconds) > 0);
>> +             /*
>> +              * We use interruptible wait_event so as not to contribute
>> +              * to the machine load average while we're sleeping.
>> +              * However, we don't actually expect to receive a signal
>> +              * since we run as a kernel thread, so the condition we were
>> +              * waiting for should be true once we get here.
>> +              */
>> +             BUG_ON(scan_seconds <= 0);
>> +
>> +             for_each_mem_cgroup_all(mem)
>> +                     memset(&mem->idle_scan_stats, 0,
>> +                            sizeof(mem->idle_scan_stats));
>> +
>> +             for_each_node_state(nid, N_HIGH_MEMORY)
>> +                     kstaled_scan_node(NODE_DATA(nid));
>> +
>> +             for_each_mem_cgroup_all(mem) {
>> +                     write_seqcount_begin(&mem->idle_page_stats_lock);
>> +                     mem->idle_page_stats = mem->idle_scan_stats;
>> +                     mem->idle_page_scans++;
>> +                     write_seqcount_end(&mem->idle_page_stats_lock);
>> +             }
>> +
>> +             schedule_timeout_interruptible(scan_seconds * HZ);
>> +     }
>> +
>> +     BUG();
>> +     return 0;       /* NOT REACHED */
>> +}
>
> OK, I'm really confused.
>
> Take a minimal machine with a single node which contains one zone.
>
> AFAICT this code will measure the number of idle pages in that zone and
> then will attribute that number into *every* cgroup in the system.
> With no discrimination between them.  So it really provided no useful
> information at all.

what happens is that we maintain two sets of stats per cgroup:
- idle_scan_stats is reset to 0 at the start of the scan, its counters
get incremented as we scan the node and find idle pages.
- idle_page_stats is what we export; at the end of a scan the tally
from the same cgroup's idle_scan_stats gets copied into this.

> I was quite surprised to see a physical page scan!  I'd have expected
> kstaled to be doing pte tree walks.

We haven't gone that way for two reasons:
- we wanted to find hot and cold file pages as well, even for files
that never get mapped into processes.
- executable files that are run periodically should appear as hot,
even if the executable is not running at the time we happen to scan.

>> +static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
>> +                                       struct kobj_attribute *attr,
>> +                                       const char *buf, size_t count)
>> +{
>> +     int err;
>> +     unsigned long input;
>> +
>> +     err = strict_strtoul(buf, 10, &input);
>
> Please use the new kstrto*() interfaces when merging up to mainline.

Done. I wasn't aware of this interface, thanks!

> I wonder if one thread machine-wide will be sufficient.  We might end
> up with per-nice threads, for example.  Like kswapd.

I can comment on the history there.

In our fakenuma based implementation we started with per-node scanning
threads. However, it turned out that for very large files, two
scanning threads could end up scanning pages that share the same
mapping so that the mapping's i_mmap_mutex would get contended. And
the same problem would also show up with large anon VMA regions and
page_lock_anon_vma(). So, we ended up needing to ensure one thread
would scan all fakenuma nodes assigned to a given cgroup, in order to
avoid performance problems.

With memcg we can't as easily know which pages to scan for a given
cgroup, so we end up with one single thread scanning the entire
memory. It's been working good enough for the memory sized and scan
rates we're interested in so far.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/8] kstaled: minimalistic implementation.
@ 2011-09-23  8:37       ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-23  8:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Thu, Sep 22, 2011 at 4:14 PM, Andrew Morton <akpm@google.com> wrote:
> nit: please prefer to use identifier "memcg" when referring to a mem_cgroup.

OK. Done in my tree, will resend it shortly.

>> +     cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
>> +     cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
>> +     cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
>
> So the user interface has units of bytes.  Was that documented
> somewhere?  Is it worth bothering with?  getpagesize() exists...

This is consistent with existing usage in memory.stat for example. I
think bytes is a good default unit, but I could be convinced to add
_in_bytes to all fields if you think that's needed.

> (Actually, do we have a documentation update for the entire feature?)

Patch 2 in the series augments Documentation/cgroups/memory.txt

>> +static inline void kstaled_scan_page(struct page *page)
>
> uninline this.  You may find that the compiler already uninlined it.
> Or it might inline it for you even if it wasn't declared inline.  gcc
> does a decent job of optimizing this stuff for us and hints are often
> unneeded.

I tend to manually inline functions that have one single call site.
Some time ago the compilers weren't smart about this, but I suppose
they might have improved. I don't care very strongly either way so
I'll just uninline it as suggested.

>> +     else if (!trylock_page(page)) {
>> +             /*
>> +              * We need to lock the page to dereference the mapping.
>> +              * But don't risk sleeping by calling lock_page().
>> +              * We don't want to stall kstaled, so we conservatively
>> +              * count locked pages as unreclaimable.
>> +              */
>
> hm.  Pages are rarely locked for very long.  They aren't locked during
> writeback.   I question the need for this?

Pages are locked during hard page faults; this is IMO sufficient
reason for the above code.

>> +     } else {
>> +             struct address_space *mapping = page->mapping;
>> +
>> +             is_locked = true;
>> +
>> +             /*
>> +              * The page is still anon - it has been continuously referenced
>> +              * since the prior check.
>> +              */
>> +             VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
>
> Really?  Are you sure that an elevated refcount is sufficient to
> stabilise both of these?

The elevated refcount stabilizes PageAnon().

The mapping is stable only after the page has been locked; note that
page->mapping was read after the page was locked. Essentially I'm
asserting that page_rmapping(page) == page->mapping, which is true for
non-anon pages.

>> +static int kstaled(void *dummy)
>> +{
>> +     while (1) {
>> +             int scan_seconds;
>> +             int nid;
>> +             struct mem_cgroup *mem;
>> +
>> +             wait_event_interruptible(kstaled_wait,
>> +                              (scan_seconds = kstaled_scan_seconds) > 0);
>> +             /*
>> +              * We use interruptible wait_event so as not to contribute
>> +              * to the machine load average while we're sleeping.
>> +              * However, we don't actually expect to receive a signal
>> +              * since we run as a kernel thread, so the condition we were
>> +              * waiting for should be true once we get here.
>> +              */
>> +             BUG_ON(scan_seconds <= 0);
>> +
>> +             for_each_mem_cgroup_all(mem)
>> +                     memset(&mem->idle_scan_stats, 0,
>> +                            sizeof(mem->idle_scan_stats));
>> +
>> +             for_each_node_state(nid, N_HIGH_MEMORY)
>> +                     kstaled_scan_node(NODE_DATA(nid));
>> +
>> +             for_each_mem_cgroup_all(mem) {
>> +                     write_seqcount_begin(&mem->idle_page_stats_lock);
>> +                     mem->idle_page_stats = mem->idle_scan_stats;
>> +                     mem->idle_page_scans++;
>> +                     write_seqcount_end(&mem->idle_page_stats_lock);
>> +             }
>> +
>> +             schedule_timeout_interruptible(scan_seconds * HZ);
>> +     }
>> +
>> +     BUG();
>> +     return 0;       /* NOT REACHED */
>> +}
>
> OK, I'm really confused.
>
> Take a minimal machine with a single node which contains one zone.
>
> AFAICT this code will measure the number of idle pages in that zone and
> then will attribute that number into *every* cgroup in the system.
> With no discrimination between them.  So it really provided no useful
> information at all.

what happens is that we maintain two sets of stats per cgroup:
- idle_scan_stats is reset to 0 at the start of the scan, its counters
get incremented as we scan the node and find idle pages.
- idle_page_stats is what we export; at the end of a scan the tally
from the same cgroup's idle_scan_stats gets copied into this.

> I was quite surprised to see a physical page scan!  I'd have expected
> kstaled to be doing pte tree walks.

We haven't gone that way for two reasons:
- we wanted to find hot and cold file pages as well, even for files
that never get mapped into processes.
- executable files that are run periodically should appear as hot,
even if the executable is not running at the time we happen to scan.

>> +static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
>> +                                       struct kobj_attribute *attr,
>> +                                       const char *buf, size_t count)
>> +{
>> +     int err;
>> +     unsigned long input;
>> +
>> +     err = strict_strtoul(buf, 10, &input);
>
> Please use the new kstrto*() interfaces when merging up to mainline.

Done. I wasn't aware of this interface, thanks!

> I wonder if one thread machine-wide will be sufficient.  We might end
> up with per-nice threads, for example.  Like kswapd.

I can comment on the history there.

In our fakenuma based implementation we started with per-node scanning
threads. However, it turned out that for very large files, two
scanning threads could end up scanning pages that share the same
mapping so that the mapping's i_mmap_mutex would get contended. And
the same problem would also show up with large anon VMA regions and
page_lock_anon_vma(). So, we ended up needing to ensure one thread
would scan all fakenuma nodes assigned to a given cgroup, in order to
avoid performance problems.

With memcg we can't as easily know which pages to scan for a given
cgroup, so we end up with one single thread scanning the entire
memory. It's been working good enough for the memory sized and scan
rates we're interested in so far.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 6/8] kstaled: rate limit pages scanned per second.
  2011-09-22 23:15     ` Andrew Morton
@ 2011-09-23 10:18       ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-23 10:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Thu, Sep 22, 2011 at 4:15 PM, Andrew Morton <akpm@google.com> wrote:
> On Fri, 16 Sep 2011 20:39:11 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> Scan some number of pages from each node every second, instead of trying to
>> scan the entime memory at once and being idle for the rest of the configured
>> interval.
>
> Well...  why?  The amount of work done per scan interval is the same
> (actually, it will be slightly increased due to cache evictions).
>
> I think we should see a good explanation of what observed problem this
> hackery^Wtweak is trying to solve.  Once that is revealed, we can
> compare the proposed solution with one based on thread policy/priority
> (for example).

There are two aspects to this:

- some people might find it nicer to have a small amount of load
during the entire scan interval, rather than some spike when we
trigger the scanning and some idle time afterwards. That part is
highly debatable and there are probably better ways to achieve this.

- jitter reduction - if we were to scan the entire memory at once
without sleeping, the pages that are scanned first would have a fairly
constant interval between times they are looked at; however if the
time to scan pages is not constant (it could vary depending on CPU
load and pages getting allocated and freed) the pages that are scanned
towards the end of each scan would have a bit more jitter. This effect
is reduced by trying to scan a fixed number of pages per second.

> This is all rather unpleasing.

Yeah, this is not my favourite patch in the series :/

Would it help if I reordered it last in the series, as it seems more
controversial & the later ones don't functionally depend on it ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 6/8] kstaled: rate limit pages scanned per second.
@ 2011-09-23 10:18       ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-23 10:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Thu, Sep 22, 2011 at 4:15 PM, Andrew Morton <akpm@google.com> wrote:
> On Fri, 16 Sep 2011 20:39:11 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> Scan some number of pages from each node every second, instead of trying to
>> scan the entime memory at once and being idle for the rest of the configured
>> interval.
>
> Well...  why?  The amount of work done per scan interval is the same
> (actually, it will be slightly increased due to cache evictions).
>
> I think we should see a good explanation of what observed problem this
> hackery^Wtweak is trying to solve.  Once that is revealed, we can
> compare the proposed solution with one based on thread policy/priority
> (for example).

There are two aspects to this:

- some people might find it nicer to have a small amount of load
during the entire scan interval, rather than some spike when we
trigger the scanning and some idle time afterwards. That part is
highly debatable and there are probably better ways to achieve this.

- jitter reduction - if we were to scan the entire memory at once
without sleeping, the pages that are scanned first would have a fairly
constant interval between times they are looked at; however if the
time to scan pages is not constant (it could vary depending on CPU
load and pages getting allocated and freed) the pages that are scanned
towards the end of each scan would have a bit more jitter. This effect
is reduced by trying to scan a fixed number of pages per second.

> This is all rather unpleasing.

Yeah, this is not my favourite patch in the series :/

Would it help if I reordered it last in the series, as it seems more
controversial & the later ones don't functionally depend on it ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] kstaled: add histogram sampling functionality
  2011-09-22 23:15     ` Andrew Morton
@ 2011-09-23 10:26       ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-23 10:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Thu, Sep 22, 2011 at 4:15 PM, Andrew Morton <akpm@google.com> wrote:
> On Fri, 16 Sep 2011 20:39:12 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
>> 240 scan intervals into /dev/cgroup/*/memory.idle_page_stats
>
> Why?  What's the use case for this feature?

In the fakenuma implementation of kstaled, we were able to configure a
different scan rate for each container (which was represented in the
kernel as a set of fakenuma nodes, rather than a memory cgroup). This
was used to reclaim memory more agressively from some containers than
others, by varying the interval after which pages would be considered
idle.

In the memcg implementation, scanning is done globally so we can't
configure a per-cgroup rate. Instead, we track the number of scan
cycles that each page has been observed to be idle for. At that point,
we could have a per-cgroup configurable threshold and report pages
that have been idle for longer than that number of scans; however it
seemed nicer to provide a full histogram since the information is
actually available.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] kstaled: add histogram sampling functionality
@ 2011-09-23 10:26       ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-23 10:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Dave Hansen,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf, Andrew Morton

On Thu, Sep 22, 2011 at 4:15 PM, Andrew Morton <akpm@google.com> wrote:
> On Fri, 16 Sep 2011 20:39:12 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
>> 240 scan intervals into /dev/cgroup/*/memory.idle_page_stats
>
> Why?  What's the use case for this feature?

In the fakenuma implementation of kstaled, we were able to configure a
different scan rate for each container (which was represented in the
kernel as a set of fakenuma nodes, rather than a memory cgroup). This
was used to reclaim memory more agressively from some containers than
others, by varying the interval after which pages would be considered
idle.

In the memcg implementation, scanning is done globally so we can't
configure a per-cgroup rate. Instead, we track the number of scan
cycles that each page has been observed to be idle for. At that point,
we could have a per-cgroup configurable threshold and report pages
that have been idle for longer than that number of scans; however it
seemed nicer to provide a full histogram since the information is
actually available.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/8] kstaled: documentation and config option.
  2011-09-17  3:39   ` Michel Lespinasse
@ 2011-09-23 19:27     ` Rik van Riel
  -1 siblings, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2011-09-23 19:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

On 09/16/2011 11:39 PM, Michel Lespinasse wrote:
> Extend memory cgroup documentation do describe the optional idle page
> tracking features, and add the corresponding configuration option.
>
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -370,3 +370,13 @@ config CLEANCACHE
>   	  in a negligible performance hit.
>
>   	  If unsure, say Y to enable cleancache
> +
> +config KSTALED
> +       depends on CGROUP_MEM_RES_CTLR

Looking at patch #3, I wonder if this needs to be dependent
on 64 bit, or at least make sure this is not selected when
a user builds a 32 bit kernel with NUMA.

The reason is that on a 32 bit system we could run out of
page flags + zone bits + node bits.

> +       bool "Per-cgroup idle page tracking"
> +       help
> +         This feature allows the kernel to report the amount of user pages
> +	 in a cgroup that have not been touched in a given time.
> +	 This information may be used to size the cgroups and/or for
> +	 job placement within a compute cluster.
> +	 See Documentation/cgroups/memory.txt for a more complete description.



-- 
All rights reversed

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/8] kstaled: documentation and config option.
@ 2011-09-23 19:27     ` Rik van Riel
  0 siblings, 0 replies; 54+ messages in thread
From: Rik van Riel @ 2011-09-23 19:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro,
	Hugh Dickins, Peter Zijlstra, Michael Wolf

On 09/16/2011 11:39 PM, Michel Lespinasse wrote:
> Extend memory cgroup documentation do describe the optional idle page
> tracking features, and add the corresponding configuration option.
>
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -370,3 +370,13 @@ config CLEANCACHE
>   	  in a negligible performance hit.
>
>   	  If unsure, say Y to enable cleancache
> +
> +config KSTALED
> +       depends on CGROUP_MEM_RES_CTLR

Looking at patch #3, I wonder if this needs to be dependent
on 64 bit, or at least make sure this is not selected when
a user builds a 32 bit kernel with NUMA.

The reason is that on a 32 bit system we could run out of
page flags + zone bits + node bits.

> +       bool "Per-cgroup idle page tracking"
> +       help
> +         This feature allows the kernel to report the amount of user pages
> +	 in a cgroup that have not been touched in a given time.
> +	 This information may be used to size the cgroups and/or for
> +	 job placement within a compute cluster.
> +	 See Documentation/cgroups/memory.txt for a more complete description.



-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
  2011-09-17  3:39 ` Michel Lespinasse
@ 2011-09-27 10:03   ` Balbir Singh
  -1 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2011-09-27 10:03 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

On Sat, Sep 17, 2011 at 9:09 AM, Michel Lespinasse <walken@google.com> wrote:
> Please comment on the following patches (which are against the v3.0 kernel).
> We are using these to collect memory utilization statistics for each cgroup
> accross many machines, and optimize job placement accordingly.
>
> The statistics are intended to be compared accross many machines - we
> don't just want to know which cgroup to reclaim from on an individual
> machine, we also need to know which machine is best to target a job onto
> within a large cluster. Also, we try to have a low impact on the normal
> MM algorithms - we think they already do a fine job balancing resources
> on individual machines, so we are not trying to mess up with that here.
>
> Patch 1 introduces no functionality; it modifies the page_referenced API
> so that it can be more easily extended in patch 3.
>
> Patch 2 documents the proposed features, and adds a configuration option
> for these. When the features are compiled in, they are still disabled
> until the administrator sets up the desired scanning interval; however
> the configuration option seems necessary as the features make use of
> 3 extra page flags - there is plenty of space for these in 64-bit builds,
> but less so in 32-bit builds...
>
> Patch 3 introduces page_referenced_kstaled(), which is similar to
> page_referenced() but is used for idle page tracking rather than
> for memory reclaimation. Since both functions clear the pte_young bits
> and we don't want them to interfere with each other, two new page flags
> are introduced that track when young pte references have been cleared by
> each of the page_referenced variants.

Sorry, I have trouble parsing this sentence, could you elaborate on "when"?


Balbir Singh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
@ 2011-09-27 10:03   ` Balbir Singh
  0 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2011-09-27 10:03 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

On Sat, Sep 17, 2011 at 9:09 AM, Michel Lespinasse <walken@google.com> wrote:
> Please comment on the following patches (which are against the v3.0 kernel).
> We are using these to collect memory utilization statistics for each cgroup
> accross many machines, and optimize job placement accordingly.
>
> The statistics are intended to be compared accross many machines - we
> don't just want to know which cgroup to reclaim from on an individual
> machine, we also need to know which machine is best to target a job onto
> within a large cluster. Also, we try to have a low impact on the normal
> MM algorithms - we think they already do a fine job balancing resources
> on individual machines, so we are not trying to mess up with that here.
>
> Patch 1 introduces no functionality; it modifies the page_referenced API
> so that it can be more easily extended in patch 3.
>
> Patch 2 documents the proposed features, and adds a configuration option
> for these. When the features are compiled in, they are still disabled
> until the administrator sets up the desired scanning interval; however
> the configuration option seems necessary as the features make use of
> 3 extra page flags - there is plenty of space for these in 64-bit builds,
> but less so in 32-bit builds...
>
> Patch 3 introduces page_referenced_kstaled(), which is similar to
> page_referenced() but is used for idle page tracking rather than
> for memory reclaimation. Since both functions clear the pte_young bits
> and we don't want them to interfere with each other, two new page flags
> are introduced that track when young pte references have been cleared by
> each of the page_referenced variants.

Sorry, I have trouble parsing this sentence, could you elaborate on "when"?


Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
  2011-09-27 10:03   ` Balbir Singh
@ 2011-09-27 10:14     ` Michel Lespinasse
  -1 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-27 10:14 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

On Tue, Sep 27, 2011 at 3:03 AM, Balbir Singh <bsingharora@gmail.com> wrote:
> On Sat, Sep 17, 2011 at 9:09 AM, Michel Lespinasse <walken@google.com> wrote:
>> Patch 3 introduces page_referenced_kstaled(), which is similar to
>> page_referenced() but is used for idle page tracking rather than
>> for memory reclaimation. Since both functions clear the pte_young bits
>> and we don't want them to interfere with each other, two new page flags
>> are introduced that track when young pte references have been cleared by
>> each of the page_referenced variants.
>
> Sorry, I have trouble parsing this sentence, could you elaborate on "when"?

page_referenced() indicates if a page was accessed since the previous
page_referenced() call.

page_referenced_kstaled() indicates if a page was accessed since the
previous page_referenced_kstaled() call.

Both of the functions need to clear PTE young bits; however we don't
want the two functions to interfere with each other. To achieve this,
we add two page bits to indicate when a young PTE has been observed by
one of the functions but not by the other.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
@ 2011-09-27 10:14     ` Michel Lespinasse
  0 siblings, 0 replies; 54+ messages in thread
From: Michel Lespinasse @ 2011-09-27 10:14 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

On Tue, Sep 27, 2011 at 3:03 AM, Balbir Singh <bsingharora@gmail.com> wrote:
> On Sat, Sep 17, 2011 at 9:09 AM, Michel Lespinasse <walken@google.com> wrote:
>> Patch 3 introduces page_referenced_kstaled(), which is similar to
>> page_referenced() but is used for idle page tracking rather than
>> for memory reclaimation. Since both functions clear the pte_young bits
>> and we don't want them to interfere with each other, two new page flags
>> are introduced that track when young pte references have been cleared by
>> each of the page_referenced variants.
>
> Sorry, I have trouble parsing this sentence, could you elaborate on "when"?

page_referenced() indicates if a page was accessed since the previous
page_referenced() call.

page_referenced_kstaled() indicates if a page was accessed since the
previous page_referenced_kstaled() call.

Both of the functions need to clear PTE young bits; however we don't
want the two functions to interfere with each other. To achieve this,
we add two page bits to indicate when a young PTE has been observed by
one of the functions but not by the other.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
  2011-09-27 10:14     ` Michel Lespinasse
@ 2011-09-27 16:50       ` Balbir Singh
  -1 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2011-09-27 16:50 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

>>
>> Sorry, I have trouble parsing this sentence, could you elaborate on "when"?
>
> page_referenced() indicates if a page was accessed since the previous
> page_referenced() call.
>
> page_referenced_kstaled() indicates if a page was accessed since the
> previous page_referenced_kstaled() call.
>
> Both of the functions need to clear PTE young bits; however we don't
> want the two functions to interfere with each other. To achieve this,
> we add two page bits to indicate when a young PTE has been observed by
> one of the functions but not by the other.

OK and this gives different page aging schemes for the same page? Is
this to track state changes

PR1 sees: PTE x young as 0
PR2 sees: PTE x as 1, the rest to 0

so PR1 and PR2 will disagree? Should I be looking deeper in the
patches to understand

Balbir

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/8] idle page tracking / working set estimation
@ 2011-09-27 16:50       ` Balbir Singh
  0 siblings, 0 replies; 54+ messages in thread
From: Balbir Singh @ 2011-09-27 16:50 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Peter Zijlstra, Michael Wolf

>>
>> Sorry, I have trouble parsing this sentence, could you elaborate on "when"?
>
> page_referenced() indicates if a page was accessed since the previous
> page_referenced() call.
>
> page_referenced_kstaled() indicates if a page was accessed since the
> previous page_referenced_kstaled() call.
>
> Both of the functions need to clear PTE young bits; however we don't
> want the two functions to interfere with each other. To achieve this,
> we add two page bits to indicate when a young PTE has been observed by
> one of the functions but not by the other.

OK and this gives different page aging schemes for the same page? Is
this to track state changes

PR1 sees: PTE x young as 0
PR2 sees: PTE x as 1, the rest to 0

so PR1 and PR2 will disagree? Should I be looking deeper in the
patches to understand

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2011-09-27 16:57 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-17  3:39 [PATCH 0/8] idle page tracking / working set estimation Michel Lespinasse
2011-09-17  3:39 ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 1/8] page_referenced: replace vm_flags parameter with struct pr_info Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-17  3:44   ` Joe Perches
2011-09-17  3:44     ` Joe Perches
2011-09-17  4:51     ` Michel Lespinasse
2011-09-17  4:51       ` Michel Lespinasse
2011-09-20 19:05   ` Rik van Riel
2011-09-20 19:05     ` Rik van Riel
2011-09-21  2:51     ` Michel Lespinasse
2011-09-21  2:51       ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 2/8] kstaled: documentation and config option Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-20 21:23   ` Rik van Riel
2011-09-20 21:23     ` Rik van Riel
2011-09-23 19:27   ` Rik van Riel
2011-09-23 19:27     ` Rik van Riel
2011-09-17  3:39 ` [PATCH 3/8] kstaled: page_referenced_kstaled() and supporting infrastructure Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-20 19:36   ` Peter Zijlstra
2011-09-20 19:36     ` Peter Zijlstra
2011-09-17  3:39 ` [PATCH 4/8] kstaled: minimalistic implementation Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-22 23:14   ` Andrew Morton
2011-09-22 23:14     ` Andrew Morton
2011-09-23  8:37     ` Michel Lespinasse
2011-09-23  8:37       ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 5/8] kstaled: skip non-RAM regions Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 6/8] kstaled: rate limit pages scanned per second Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-22 23:15   ` Andrew Morton
2011-09-22 23:15     ` Andrew Morton
2011-09-23 10:18     ` Michel Lespinasse
2011-09-23 10:18       ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 7/8] kstaled: add histogram sampling functionality Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-22 23:15   ` Andrew Morton
2011-09-22 23:15     ` Andrew Morton
2011-09-23 10:26     ` Michel Lespinasse
2011-09-23 10:26       ` Michel Lespinasse
2011-09-17  3:39 ` [PATCH 8/8] kstaled: add incrementally updating stale page count Michel Lespinasse
2011-09-17  3:39   ` Michel Lespinasse
2011-09-22 23:13 ` [PATCH 0/8] idle page tracking / working set estimation Andrew Morton
2011-09-22 23:13   ` Andrew Morton
2011-09-23  1:23   ` Michel Lespinasse
2011-09-23  1:23     ` Michel Lespinasse
2011-09-27 10:03 ` Balbir Singh
2011-09-27 10:03   ` Balbir Singh
2011-09-27 10:14   ` Michel Lespinasse
2011-09-27 10:14     ` Michel Lespinasse
2011-09-27 16:50     ` Balbir Singh
2011-09-27 16:50       ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.