All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] V2: idle page tracking / working set estimation
@ 2011-09-28  0:48 ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

This is a followup to the prior version of this patchset, which I sent out
on September 16.

I have addressed most of the basic feedback I got so far:

- Renamed struct pr_info -> struct page_referenced_info

- Config option now depends on 64BIT, as we may not have sufficient
  free page flags in 32-bit builds

- Renamed mem -> memcg in kstaled code within memcontrol.c

- Uninlined kstaled_scan_page

- Replaced strict_strtoul -> kstrtoul

- Report PG_stale in /proc/kpageflags

- Fix accounting of THP pages. Sorry for forgeting to do this in the
  V1 patchset - to detail the change here, what I had to do was make sure
  page_referenced() reports THP pages as dirty (as they always are - the
  dirty bit in the pmd is currently meaningless) and update the minimalistic
  implementation change to count THP pages as equivalent to 512 small pages.

- The ugliest parts of patch 6 (rate limit pages scanned per second) have
  been reworked. If the scanning thread gets delayed, it tries to catch up
  so as to minimize jitter. If it can't catch up, it would probably be a
  good idea to increase the scanning interval, but this is left up
  to userspace.

Michel Lespinasse (9):
  page_referenced: replace vm_flags parameter with struct page_referenced_info
  kstaled: documentation and config option.
  kstaled: page_referenced_kstaled() and supporting infrastructure.
  kstaled: minimalistic implementation.
  kstaled: skip non-RAM regions.
  kstaled: rate limit pages scanned per second.
  kstaled: add histogram sampling functionality
  kstaled: add incrementally updating stale page count
  kstaled: export PG_stale in /proc/kpageflags

 Documentation/cgroups/memory.txt  |  103 ++++++++-
 arch/x86/include/asm/page_types.h |    8 +
 arch/x86/kernel/e820.c            |   45 ++++
 fs/proc/page.c                    |    4 +
 include/linux/kernel-page-flags.h |    2 +
 include/linux/ksm.h               |    9 +-
 include/linux/mmzone.h            |   11 +
 include/linux/page-flags.h        |   50 ++++
 include/linux/pagemap.h           |   11 +-
 include/linux/rmap.h              |   82 ++++++-
 mm/Kconfig                        |   10 +
 mm/internal.h                     |    1 +
 mm/ksm.c                          |   15 +-
 mm/memcontrol.c                   |  479 +++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c               |    6 +
 mm/mlock.c                        |    1 +
 mm/rmap.c                         |  138 ++++++-----
 mm/swap.c                         |    1 +
 mm/vmscan.c                       |   20 +-
 19 files changed, 899 insertions(+), 97 deletions(-)

-- 
1.7.3.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 0/9] V2: idle page tracking / working set estimation
@ 2011-09-28  0:48 ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

This is a followup to the prior version of this patchset, which I sent out
on September 16.

I have addressed most of the basic feedback I got so far:

- Renamed struct pr_info -> struct page_referenced_info

- Config option now depends on 64BIT, as we may not have sufficient
  free page flags in 32-bit builds

- Renamed mem -> memcg in kstaled code within memcontrol.c

- Uninlined kstaled_scan_page

- Replaced strict_strtoul -> kstrtoul

- Report PG_stale in /proc/kpageflags

- Fix accounting of THP pages. Sorry for forgeting to do this in the
  V1 patchset - to detail the change here, what I had to do was make sure
  page_referenced() reports THP pages as dirty (as they always are - the
  dirty bit in the pmd is currently meaningless) and update the minimalistic
  implementation change to count THP pages as equivalent to 512 small pages.

- The ugliest parts of patch 6 (rate limit pages scanned per second) have
  been reworked. If the scanning thread gets delayed, it tries to catch up
  so as to minimize jitter. If it can't catch up, it would probably be a
  good idea to increase the scanning interval, but this is left up
  to userspace.

Michel Lespinasse (9):
  page_referenced: replace vm_flags parameter with struct page_referenced_info
  kstaled: documentation and config option.
  kstaled: page_referenced_kstaled() and supporting infrastructure.
  kstaled: minimalistic implementation.
  kstaled: skip non-RAM regions.
  kstaled: rate limit pages scanned per second.
  kstaled: add histogram sampling functionality
  kstaled: add incrementally updating stale page count
  kstaled: export PG_stale in /proc/kpageflags

 Documentation/cgroups/memory.txt  |  103 ++++++++-
 arch/x86/include/asm/page_types.h |    8 +
 arch/x86/kernel/e820.c            |   45 ++++
 fs/proc/page.c                    |    4 +
 include/linux/kernel-page-flags.h |    2 +
 include/linux/ksm.h               |    9 +-
 include/linux/mmzone.h            |   11 +
 include/linux/page-flags.h        |   50 ++++
 include/linux/pagemap.h           |   11 +-
 include/linux/rmap.h              |   82 ++++++-
 mm/Kconfig                        |   10 +
 mm/internal.h                     |    1 +
 mm/ksm.c                          |   15 +-
 mm/memcontrol.c                   |  479 +++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c               |    6 +
 mm/mlock.c                        |    1 +
 mm/rmap.c                         |  138 ++++++-----
 mm/swap.c                         |    1 +
 mm/vmscan.c                       |   20 +-
 19 files changed, 899 insertions(+), 97 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/9] page_referenced: replace vm_flags parameter with struct page_referenced_info
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Introduce struct page_referenced_info, passed into page_referenced() family
of functions, to represent information about the pte references that have
been found for that page. Currently contains the vm_flags information as
well as a PR_REFERENCED flag. The idea is to make it easy to extend the API
with new flags.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/ksm.h  |    9 ++---
 include/linux/rmap.h |   28 ++++++++++-----
 mm/ksm.c             |   15 +++-----
 mm/rmap.c            |   92 +++++++++++++++++++++++---------------------------
 mm/vmscan.c          |   18 +++++----
 5 files changed, 81 insertions(+), 81 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3319a69..fac4b16 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -83,8 +83,8 @@ static inline int ksm_might_need_to_copy(struct page *page,
 		 page->index != linear_page_index(vma, address));
 }
 
-int page_referenced_ksm(struct page *page,
-			struct mem_cgroup *memcg, unsigned long *vm_flags);
+void page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
+			 struct page_referenced_info *info);
 int try_to_unmap_ksm(struct page *page, enum ttu_flags flags);
 int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 		  struct vm_area_struct *, unsigned long, void *), void *arg);
@@ -119,10 +119,9 @@ static inline int ksm_might_need_to_copy(struct page *page,
 	return 0;
 }
 
-static inline int page_referenced_ksm(struct page *page,
-			struct mem_cgroup *memcg, unsigned long *vm_flags)
+static inline void page_referenced_ksm(struct page *page,
+		struct mem_cgroup *memcg, struct page_referenced_info *info)
 {
-	return 0;
 }
 
 static inline int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..82fef42 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -67,6 +67,15 @@ struct anon_vma_chain {
 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
 };
 
+/*
+ * Information to be filled by page_referenced() and friends.
+ */
+struct page_referenced_info {
+	unsigned long vm_flags;
+	unsigned int pr_flags;
+#define PR_REFERENCED  1
+};
+
 #ifdef CONFIG_MMU
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
@@ -156,10 +165,11 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *cnt, unsigned long *vm_flags);
-int page_referenced_one(struct page *, struct vm_area_struct *,
-	unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
+void page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt,
+		     struct page_referenced_info *info);
+void page_referenced_one(struct page *, struct vm_area_struct *,
+			 unsigned long address, unsigned int *mapcount,
+			 struct page_referenced_info *info);
 
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
@@ -234,12 +244,12 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 #define anon_vma_prepare(vma)	(0)
 #define anon_vma_link(vma)	do {} while (0)
 
-static inline int page_referenced(struct page *page, int is_locked,
-				  struct mem_cgroup *cnt,
-				  unsigned long *vm_flags)
+static inline void page_referenced(struct page *page, int is_locked,
+				   struct mem_cgroup *cnt,
+				   struct page_referenced_info *info)
 {
-	*vm_flags = 0;
-	return 0;
+	info->vm_flags = 0;
+	info->pr_flags = 0;
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
diff --git a/mm/ksm.c b/mm/ksm.c
index 9a68b0c..fc3fb06 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1587,14 +1587,13 @@ struct page *ksm_does_need_to_copy(struct page *page,
 	return new_page;
 }
 
-int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
-			unsigned long *vm_flags)
+void page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
+			struct page_referenced_info *info)
 {
 	struct stable_node *stable_node;
 	struct rmap_item *rmap_item;
 	struct hlist_node *hlist;
 	unsigned int mapcount = page_mapcount(page);
-	int referenced = 0;
 	int search_new_forks = 0;
 
 	VM_BUG_ON(!PageKsm(page));
@@ -1602,7 +1601,7 @@ int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
 
 	stable_node = page_stable_node(page);
 	if (!stable_node)
-		return 0;
+		return;
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
@@ -1627,19 +1626,17 @@ again:
 			if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
 				continue;
 
-			referenced += page_referenced_one(page, vma,
-				rmap_item->address, &mapcount, vm_flags);
+			page_referenced_one(page, vma, rmap_item->address,
+					    &mapcount, info);
 			if (!search_new_forks || !mapcount)
 				break;
 		}
 		anon_vma_unlock(anon_vma);
 		if (!mapcount)
-			goto out;
+			return;
 	}
 	if (!search_new_forks++)
 		goto again;
-out:
-	return referenced;
 }
 
 int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/mm/rmap.c b/mm/rmap.c
index 23295f6..f87afd0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -648,12 +648,12 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
  */
-int page_referenced_one(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, unsigned int *mapcount,
-			unsigned long *vm_flags)
+void page_referenced_one(struct page *page, struct vm_area_struct *vma,
+			 unsigned long address, unsigned int *mapcount,
+			 struct page_referenced_info *info)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int referenced = 0;
+	bool referenced = false;
 
 	if (unlikely(PageTransHuge(page))) {
 		pmd_t *pmd;
@@ -667,19 +667,19 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 					     PAGE_CHECK_ADDRESS_PMD_FLAG);
 		if (!pmd) {
 			spin_unlock(&mm->page_table_lock);
-			goto out;
+			return;
 		}
 
 		if (vma->vm_flags & VM_LOCKED) {
 			spin_unlock(&mm->page_table_lock);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
-			goto out;
+			info->vm_flags |= VM_LOCKED;
+			return;
 		}
 
 		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
+			referenced = true;
 		spin_unlock(&mm->page_table_lock);
 	} else {
 		pte_t *pte;
@@ -691,13 +691,13 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		pte = page_check_address(page, mm, address, &ptl, 0);
 		if (!pte)
-			goto out;
+			return;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
-			goto out;
+			info->vm_flags |= VM_LOCKED;
+			return;
 		}
 
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
@@ -709,7 +709,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			 * set PG_referenced or activated the page.
 			 */
 			if (likely(!VM_SequentialReadHint(vma)))
-				referenced++;
+				referenced = true;
 		}
 		pte_unmap_unlock(pte, ptl);
 	}
@@ -718,28 +718,27 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	   swap token and is in the middle of a page fault. */
 	if (mm != current->mm && has_swap_token(mm) &&
 			rwsem_is_locked(&mm->mmap_sem))
-		referenced++;
+		referenced = true;
 
 	(*mapcount)--;
 
-	if (referenced)
-		*vm_flags |= vma->vm_flags;
-out:
-	return referenced;
+	if (referenced) {
+		info->vm_flags |= vma->vm_flags;
+		info->pr_flags |= PR_REFERENCED;
+	}
 }
 
-static int page_referenced_anon(struct page *page,
-				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+static void page_referenced_anon(struct page *page,
+				 struct mem_cgroup *mem_cont,
+				 struct page_referenced_info *info)
 {
 	unsigned int mapcount;
 	struct anon_vma *anon_vma;
 	struct anon_vma_chain *avc;
-	int referenced = 0;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
-		return referenced;
+		return;
 
 	mapcount = page_mapcount(page);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
@@ -754,21 +753,20 @@ static int page_referenced_anon(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma, address,
-						  &mapcount, vm_flags);
+		page_referenced_one(page, vma, address, &mapcount, info);
 		if (!mapcount)
 			break;
 	}
 
 	page_unlock_anon_vma(anon_vma);
-	return referenced;
 }
 
 /**
  * page_referenced_file - referenced check for object-based rmap
  * @page: the page we're checking references on.
  * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @info: collect encountered vma->vm_flags who actually referenced the page
+ *        as well as flags describing the page references encountered.
  *
  * For an object-based mapped page, find all the places it is mapped and
  * check/clear the referenced flag.  This is done by following the page->mapping
@@ -777,16 +775,15 @@ static int page_referenced_anon(struct page *page,
  *
  * This function is only called from page_referenced for object-based pages.
  */
-static int page_referenced_file(struct page *page,
-				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+static void page_referenced_file(struct page *page,
+				 struct mem_cgroup *mem_cont,
+				 struct page_referenced_info *info)
 {
 	unsigned int mapcount;
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
-	int referenced = 0;
 
 	/*
 	 * The caller's checks on page->mapping and !PageAnon have made
@@ -822,14 +819,12 @@ static int page_referenced_file(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma, address,
-						  &mapcount, vm_flags);
+		page_referenced_one(page, vma, address, &mapcount, info);
 		if (!mapcount)
 			break;
 	}
 
 	mutex_unlock(&mapping->i_mmap_mutex);
-	return referenced;
 }
 
 /**
@@ -837,45 +832,42 @@ static int page_referenced_file(struct page *page,
  * @page: the page to test
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @info: collect encountered vma->vm_flags who actually referenced the page
+ *        as well as flags describing the page references encountered.
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-int page_referenced(struct page *page,
-		    int is_locked,
-		    struct mem_cgroup *mem_cont,
-		    unsigned long *vm_flags)
+void page_referenced(struct page *page,
+		     int is_locked,
+		     struct mem_cgroup *mem_cont,
+		     struct page_referenced_info *info)
 {
-	int referenced = 0;
 	int we_locked = 0;
 
-	*vm_flags = 0;
+	info->vm_flags = 0;
+	info->pr_flags = 0;
+
 	if (page_mapped(page) && page_rmapping(page)) {
 		if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
 			we_locked = trylock_page(page);
 			if (!we_locked) {
-				referenced++;
+				info->pr_flags |= PR_REFERENCED;
 				goto out;
 			}
 		}
 		if (unlikely(PageKsm(page)))
-			referenced += page_referenced_ksm(page, mem_cont,
-								vm_flags);
+			page_referenced_ksm(page, mem_cont, info);
 		else if (PageAnon(page))
-			referenced += page_referenced_anon(page, mem_cont,
-								vm_flags);
+			page_referenced_anon(page, mem_cont, info);
 		else if (page->mapping)
-			referenced += page_referenced_file(page, mem_cont,
-								vm_flags);
+			page_referenced_file(page, mem_cont, info);
 		if (we_locked)
 			unlock_page(page);
 	}
 out:
 	if (page_test_and_clear_young(page_to_pfn(page)))
-		referenced++;
-
-	return referenced;
+		info->pr_flags |= PR_REFERENCED;
 }
 
 static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d036e59..f0a8a1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -647,10 +647,10 @@ enum page_references {
 static enum page_references page_check_references(struct page *page,
 						  struct scan_control *sc)
 {
-	int referenced_ptes, referenced_page;
-	unsigned long vm_flags;
+	int referenced_page;
+	struct page_referenced_info info;
 
-	referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+	page_referenced(page, 1, sc->mem_cgroup, &info);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -661,10 +661,10 @@ static enum page_references page_check_references(struct page *page,
 	 * Mlock lost the isolation race with us.  Let try_to_unmap()
 	 * move the page to the unevictable list.
 	 */
-	if (vm_flags & VM_LOCKED)
+	if (info.vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
-	if (referenced_ptes) {
+	if (info.pr_flags & PR_REFERENCED) {
 		if (PageAnon(page))
 			return PAGEREF_ACTIVATE;
 		/*
@@ -1535,7 +1535,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 {
 	unsigned long nr_taken;
 	unsigned long pgscanned;
-	unsigned long vm_flags;
+	struct page_referenced_info info;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
@@ -1582,7 +1582,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			continue;
 		}
 
-		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		page_referenced(page, 0, sc->mem_cgroup, &info);
+		if (info.pr_flags & PR_REFERENCED) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1593,7 +1594,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			 * IO, plus JVM can create lots of anon VM_EXEC pages,
 			 * so we ignore them here.
 			 */
-			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
+			if ((info.vm_flags & VM_EXEC) &&
+			    page_is_file_cache(page)) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 1/9] page_referenced: replace vm_flags parameter with struct page_referenced_info
@ 2011-09-28  0:48   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Introduce struct page_referenced_info, passed into page_referenced() family
of functions, to represent information about the pte references that have
been found for that page. Currently contains the vm_flags information as
well as a PR_REFERENCED flag. The idea is to make it easy to extend the API
with new flags.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/ksm.h  |    9 ++---
 include/linux/rmap.h |   28 ++++++++++-----
 mm/ksm.c             |   15 +++-----
 mm/rmap.c            |   92 +++++++++++++++++++++++---------------------------
 mm/vmscan.c          |   18 +++++----
 5 files changed, 81 insertions(+), 81 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3319a69..fac4b16 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -83,8 +83,8 @@ static inline int ksm_might_need_to_copy(struct page *page,
 		 page->index != linear_page_index(vma, address));
 }
 
-int page_referenced_ksm(struct page *page,
-			struct mem_cgroup *memcg, unsigned long *vm_flags);
+void page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
+			 struct page_referenced_info *info);
 int try_to_unmap_ksm(struct page *page, enum ttu_flags flags);
 int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 		  struct vm_area_struct *, unsigned long, void *), void *arg);
@@ -119,10 +119,9 @@ static inline int ksm_might_need_to_copy(struct page *page,
 	return 0;
 }
 
-static inline int page_referenced_ksm(struct page *page,
-			struct mem_cgroup *memcg, unsigned long *vm_flags)
+static inline void page_referenced_ksm(struct page *page,
+		struct mem_cgroup *memcg, struct page_referenced_info *info)
 {
-	return 0;
 }
 
 static inline int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..82fef42 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -67,6 +67,15 @@ struct anon_vma_chain {
 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
 };
 
+/*
+ * Information to be filled by page_referenced() and friends.
+ */
+struct page_referenced_info {
+	unsigned long vm_flags;
+	unsigned int pr_flags;
+#define PR_REFERENCED  1
+};
+
 #ifdef CONFIG_MMU
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
@@ -156,10 +165,11 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *cnt, unsigned long *vm_flags);
-int page_referenced_one(struct page *, struct vm_area_struct *,
-	unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
+void page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt,
+		     struct page_referenced_info *info);
+void page_referenced_one(struct page *, struct vm_area_struct *,
+			 unsigned long address, unsigned int *mapcount,
+			 struct page_referenced_info *info);
 
 enum ttu_flags {
 	TTU_UNMAP = 0,			/* unmap mode */
@@ -234,12 +244,12 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 #define anon_vma_prepare(vma)	(0)
 #define anon_vma_link(vma)	do {} while (0)
 
-static inline int page_referenced(struct page *page, int is_locked,
-				  struct mem_cgroup *cnt,
-				  unsigned long *vm_flags)
+static inline void page_referenced(struct page *page, int is_locked,
+				   struct mem_cgroup *cnt,
+				   struct page_referenced_info *info)
 {
-	*vm_flags = 0;
-	return 0;
+	info->vm_flags = 0;
+	info->pr_flags = 0;
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
diff --git a/mm/ksm.c b/mm/ksm.c
index 9a68b0c..fc3fb06 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1587,14 +1587,13 @@ struct page *ksm_does_need_to_copy(struct page *page,
 	return new_page;
 }
 
-int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
-			unsigned long *vm_flags)
+void page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
+			struct page_referenced_info *info)
 {
 	struct stable_node *stable_node;
 	struct rmap_item *rmap_item;
 	struct hlist_node *hlist;
 	unsigned int mapcount = page_mapcount(page);
-	int referenced = 0;
 	int search_new_forks = 0;
 
 	VM_BUG_ON(!PageKsm(page));
@@ -1602,7 +1601,7 @@ int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
 
 	stable_node = page_stable_node(page);
 	if (!stable_node)
-		return 0;
+		return;
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
@@ -1627,19 +1626,17 @@ again:
 			if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
 				continue;
 
-			referenced += page_referenced_one(page, vma,
-				rmap_item->address, &mapcount, vm_flags);
+			page_referenced_one(page, vma, rmap_item->address,
+					    &mapcount, info);
 			if (!search_new_forks || !mapcount)
 				break;
 		}
 		anon_vma_unlock(anon_vma);
 		if (!mapcount)
-			goto out;
+			return;
 	}
 	if (!search_new_forks++)
 		goto again;
-out:
-	return referenced;
 }
 
 int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/mm/rmap.c b/mm/rmap.c
index 23295f6..f87afd0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -648,12 +648,12 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
  */
-int page_referenced_one(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, unsigned int *mapcount,
-			unsigned long *vm_flags)
+void page_referenced_one(struct page *page, struct vm_area_struct *vma,
+			 unsigned long address, unsigned int *mapcount,
+			 struct page_referenced_info *info)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int referenced = 0;
+	bool referenced = false;
 
 	if (unlikely(PageTransHuge(page))) {
 		pmd_t *pmd;
@@ -667,19 +667,19 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 					     PAGE_CHECK_ADDRESS_PMD_FLAG);
 		if (!pmd) {
 			spin_unlock(&mm->page_table_lock);
-			goto out;
+			return;
 		}
 
 		if (vma->vm_flags & VM_LOCKED) {
 			spin_unlock(&mm->page_table_lock);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
-			goto out;
+			info->vm_flags |= VM_LOCKED;
+			return;
 		}
 
 		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
+			referenced = true;
 		spin_unlock(&mm->page_table_lock);
 	} else {
 		pte_t *pte;
@@ -691,13 +691,13 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		pte = page_check_address(page, mm, address, &ptl, 0);
 		if (!pte)
-			goto out;
+			return;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
-			*vm_flags |= VM_LOCKED;
-			goto out;
+			info->vm_flags |= VM_LOCKED;
+			return;
 		}
 
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
@@ -709,7 +709,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			 * set PG_referenced or activated the page.
 			 */
 			if (likely(!VM_SequentialReadHint(vma)))
-				referenced++;
+				referenced = true;
 		}
 		pte_unmap_unlock(pte, ptl);
 	}
@@ -718,28 +718,27 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	   swap token and is in the middle of a page fault. */
 	if (mm != current->mm && has_swap_token(mm) &&
 			rwsem_is_locked(&mm->mmap_sem))
-		referenced++;
+		referenced = true;
 
 	(*mapcount)--;
 
-	if (referenced)
-		*vm_flags |= vma->vm_flags;
-out:
-	return referenced;
+	if (referenced) {
+		info->vm_flags |= vma->vm_flags;
+		info->pr_flags |= PR_REFERENCED;
+	}
 }
 
-static int page_referenced_anon(struct page *page,
-				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+static void page_referenced_anon(struct page *page,
+				 struct mem_cgroup *mem_cont,
+				 struct page_referenced_info *info)
 {
 	unsigned int mapcount;
 	struct anon_vma *anon_vma;
 	struct anon_vma_chain *avc;
-	int referenced = 0;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
-		return referenced;
+		return;
 
 	mapcount = page_mapcount(page);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
@@ -754,21 +753,20 @@ static int page_referenced_anon(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma, address,
-						  &mapcount, vm_flags);
+		page_referenced_one(page, vma, address, &mapcount, info);
 		if (!mapcount)
 			break;
 	}
 
 	page_unlock_anon_vma(anon_vma);
-	return referenced;
 }
 
 /**
  * page_referenced_file - referenced check for object-based rmap
  * @page: the page we're checking references on.
  * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @info: collect encountered vma->vm_flags who actually referenced the page
+ *        as well as flags describing the page references encountered.
  *
  * For an object-based mapped page, find all the places it is mapped and
  * check/clear the referenced flag.  This is done by following the page->mapping
@@ -777,16 +775,15 @@ static int page_referenced_anon(struct page *page,
  *
  * This function is only called from page_referenced for object-based pages.
  */
-static int page_referenced_file(struct page *page,
-				struct mem_cgroup *mem_cont,
-				unsigned long *vm_flags)
+static void page_referenced_file(struct page *page,
+				 struct mem_cgroup *mem_cont,
+				 struct page_referenced_info *info)
 {
 	unsigned int mapcount;
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
-	int referenced = 0;
 
 	/*
 	 * The caller's checks on page->mapping and !PageAnon have made
@@ -822,14 +819,12 @@ static int page_referenced_file(struct page *page,
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		referenced += page_referenced_one(page, vma, address,
-						  &mapcount, vm_flags);
+		page_referenced_one(page, vma, address, &mapcount, info);
 		if (!mapcount)
 			break;
 	}
 
 	mutex_unlock(&mapping->i_mmap_mutex);
-	return referenced;
 }
 
 /**
@@ -837,45 +832,42 @@ static int page_referenced_file(struct page *page,
  * @page: the page to test
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @info: collect encountered vma->vm_flags who actually referenced the page
+ *        as well as flags describing the page references encountered.
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-int page_referenced(struct page *page,
-		    int is_locked,
-		    struct mem_cgroup *mem_cont,
-		    unsigned long *vm_flags)
+void page_referenced(struct page *page,
+		     int is_locked,
+		     struct mem_cgroup *mem_cont,
+		     struct page_referenced_info *info)
 {
-	int referenced = 0;
 	int we_locked = 0;
 
-	*vm_flags = 0;
+	info->vm_flags = 0;
+	info->pr_flags = 0;
+
 	if (page_mapped(page) && page_rmapping(page)) {
 		if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
 			we_locked = trylock_page(page);
 			if (!we_locked) {
-				referenced++;
+				info->pr_flags |= PR_REFERENCED;
 				goto out;
 			}
 		}
 		if (unlikely(PageKsm(page)))
-			referenced += page_referenced_ksm(page, mem_cont,
-								vm_flags);
+			page_referenced_ksm(page, mem_cont, info);
 		else if (PageAnon(page))
-			referenced += page_referenced_anon(page, mem_cont,
-								vm_flags);
+			page_referenced_anon(page, mem_cont, info);
 		else if (page->mapping)
-			referenced += page_referenced_file(page, mem_cont,
-								vm_flags);
+			page_referenced_file(page, mem_cont, info);
 		if (we_locked)
 			unlock_page(page);
 	}
 out:
 	if (page_test_and_clear_young(page_to_pfn(page)))
-		referenced++;
-
-	return referenced;
+		info->pr_flags |= PR_REFERENCED;
 }
 
 static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d036e59..f0a8a1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -647,10 +647,10 @@ enum page_references {
 static enum page_references page_check_references(struct page *page,
 						  struct scan_control *sc)
 {
-	int referenced_ptes, referenced_page;
-	unsigned long vm_flags;
+	int referenced_page;
+	struct page_referenced_info info;
 
-	referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+	page_referenced(page, 1, sc->mem_cgroup, &info);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -661,10 +661,10 @@ static enum page_references page_check_references(struct page *page,
 	 * Mlock lost the isolation race with us.  Let try_to_unmap()
 	 * move the page to the unevictable list.
 	 */
-	if (vm_flags & VM_LOCKED)
+	if (info.vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
-	if (referenced_ptes) {
+	if (info.pr_flags & PR_REFERENCED) {
 		if (PageAnon(page))
 			return PAGEREF_ACTIVATE;
 		/*
@@ -1535,7 +1535,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 {
 	unsigned long nr_taken;
 	unsigned long pgscanned;
-	unsigned long vm_flags;
+	struct page_referenced_info info;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
@@ -1582,7 +1582,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			continue;
 		}
 
-		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		page_referenced(page, 0, sc->mem_cgroup, &info);
+		if (info.pr_flags & PR_REFERENCED) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1593,7 +1594,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			 * IO, plus JVM can create lots of anon VM_EXEC pages,
 			 * so we ignore them here.
 			 */
-			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
+			if ((info.vm_flags & VM_EXEC) &&
+			    page_is_file_cache(page)) {
 				list_add(&page->lru, &l_active);
 				continue;
 			}
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 2/9] kstaled: documentation and config option.
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:49   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Extend memory cgroup documentation do describe the optional idle page
tracking features, and add the corresponding configuration option.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 Documentation/cgroups/memory.txt |  103 +++++++++++++++++++++++++++++++++++++-
 mm/Kconfig                       |   10 ++++
 2 files changed, 112 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 06eb6d9..7ee2eb3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -672,7 +672,108 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
-11. TODO
+
+11. Idle page tracking
+
+Idle page tracking works by scanning physical memory at a known rate,
+finding idle pages, and accounting for them in the cgroup owning them.
+
+Idle pages are defined as user pages (either anon or file backed) that have
+not been accessed for a number of consecutive scans, and are also not
+currently pinned down (for example by being mlocked).
+
+11.1 Usage
+
+The first step is to select the global scanning rate:
+
+# echo 120 > /sys/kernel/mm/kstaled/scan_seconds	# 2 minutes per scan
+
+(At boot time, the default value for /sys/kernel/mm/kstaled/scan_seconds
+is 0 which means the idle page tracking feature is disabled).
+
+Then, the per-cgroup memory.idle_page_stats files get updated at the
+end of every scan. The relevant fields are:
+* idle_clean: idle pages that have been untouched for at least one scan cycle,
+  and are also clean. Being clean and unpinned, such pages are immediately
+  reclaimable by the MM's LRU algorithms.
+* idle_dirty_file: idle pages that have been untouched for at least one
+  scan cycle, are dirty, and are file backed. Such pages are not immediately
+  reclaimable as writeback needs to occur first.
+* idle_dirty_swap: idle pages that have been untouched for at least one
+  scan cycle, are dirty, and would have to be written to swap before being
+  reclaimed. This includes dirty anon memory, tmpfs files and shm segments.
+  Note that such pages are counted as idle_dirty_swap regardless of whether
+  swap is enabled or not on the system.
+* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
+  above, but for pages that have been untouched for at least two scan cycles.
+* these fields repeat up to idle_240_clean, idle_240_dirty_file and
+  idle_240_dirty_swap, allowing one to observe idle pages over a variety
+  of idle interval lengths. Note that the accounting is cumulative:
+  pages counted as idle for a given interval length are also counted
+  as idle for smaller interval lengths.
+* scans: number of physical memory scans since the cgroup was created.
+
+All the above fields are updated exactly once per scan.
+
+11.2 Responsiveness guarantees
+
+After a user page stops being touched and/or pinned, it takes at least one
+scan cycle for that page to be considered as idle and accounted as such
+in one of the idle_clean / idle_dirty_file / idle_dirty_swap counts
+(or, n scan cycles for the page to be accounted as idle in one of the
+idle_N_clean / idle_N_dirty_file / idle_N_dirty_swap counts).
+
+However, there is no guarantee that pages will be detected that fast.
+In the worst case, it could take up to two extra scan cycle intervals
+for a page to be accounted as idle. This is because after userspace stops
+touching the page, it may take up to one scan interval before we next
+scan it (at which point the page will be seen as not idle yet since it
+was touched during the previous scan) and after the page is finally scanned
+again and detected as idle, it may take up to one extra scan interval before
+completing the physical memory scan and exporting the updated statistics.
+
+Conversely, when userspace touches or pins a page that was previously
+accounted for as idle, it may take up to two scan intervals before the
+corresponding statistics are updated. Once again, this is because it may
+take up to one scan interval before scanning the page and finding it not
+idle anymore, and up to one extra scan interval before completing the
+physical memory scan and exporting the updated statistics.
+
+11.3 Incremental idle page tracking
+
+In some situations, it is desired to obtain faster feedback when
+previously idle, clean user pages start being touched. Remember that
+unpinned clean pages are immediately reclaimable by the MM's LRU
+algorithms. A high number of such pages being idle in a given cgroup
+indicates that this cgroup is not experiencing high memory pressure.
+A decrease of that number can be seen as a leading indicator that
+memory pressure is about to increase, and it may be desired to act
+upon that indication before the two scan interval measurement delay.
+
+The incremental idle page tracking feature can be used for that case.
+It allows for tracking of idle clean pages only, and only for a
+predetermined number of scan intervals (no histogram functionality as
+in the main interface).
+
+The desired idle period must first be selected on a per-cgroup basis
+by writing an integer to the memory.stale_page_age file. The integer
+is the interval we want pages to be idle for, expressed in scan cycles.
+For example to check for pages that have been idle for 5 consecutive
+scan cycles (equivalent to the idle_5_clean statistic), one would
+write 5 to the memory.stale_page_age file. The default value for the
+memory.stale_page_age file is 0, which disables the incremental idle
+page tracking feature.
+
+During scanning, clean unpinned pages that have not been touched for the
+chosen number of scan cycles are incrementally accounted for and reflected
+in the "stale" statistic in memory.idle_page_stats. Likewise, pages that
+were previously accounted as stale and are found not to be idle anymore
+are also incrementally accounted for. Additionally, any pages that are
+being considered by the LRU replacement algorithm and found to have been
+touched are also incrementally accounted for.
+
+
+12. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/mm/Kconfig b/mm/Kconfig
index 8ca47a5..f6443a0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -370,3 +370,13 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config KSTALED
+       depends on CGROUP_MEM_RES_CTLR && 64BIT
+       bool "Per-cgroup idle page tracking"
+       help
+         This feature allows the kernel to report the amount of user pages
+	 in a cgroup that have not been touched in a given time.
+	 This information may be used to size the cgroups and/or for
+	 job placement within a compute cluster.
+	 See Documentation/cgroups/memory.txt for a more complete description.
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 2/9] kstaled: documentation and config option.
@ 2011-09-28  0:49   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Extend memory cgroup documentation do describe the optional idle page
tracking features, and add the corresponding configuration option.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 Documentation/cgroups/memory.txt |  103 +++++++++++++++++++++++++++++++++++++-
 mm/Kconfig                       |   10 ++++
 2 files changed, 112 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 06eb6d9..7ee2eb3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -672,7 +672,108 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
-11. TODO
+
+11. Idle page tracking
+
+Idle page tracking works by scanning physical memory at a known rate,
+finding idle pages, and accounting for them in the cgroup owning them.
+
+Idle pages are defined as user pages (either anon or file backed) that have
+not been accessed for a number of consecutive scans, and are also not
+currently pinned down (for example by being mlocked).
+
+11.1 Usage
+
+The first step is to select the global scanning rate:
+
+# echo 120 > /sys/kernel/mm/kstaled/scan_seconds	# 2 minutes per scan
+
+(At boot time, the default value for /sys/kernel/mm/kstaled/scan_seconds
+is 0 which means the idle page tracking feature is disabled).
+
+Then, the per-cgroup memory.idle_page_stats files get updated at the
+end of every scan. The relevant fields are:
+* idle_clean: idle pages that have been untouched for at least one scan cycle,
+  and are also clean. Being clean and unpinned, such pages are immediately
+  reclaimable by the MM's LRU algorithms.
+* idle_dirty_file: idle pages that have been untouched for at least one
+  scan cycle, are dirty, and are file backed. Such pages are not immediately
+  reclaimable as writeback needs to occur first.
+* idle_dirty_swap: idle pages that have been untouched for at least one
+  scan cycle, are dirty, and would have to be written to swap before being
+  reclaimed. This includes dirty anon memory, tmpfs files and shm segments.
+  Note that such pages are counted as idle_dirty_swap regardless of whether
+  swap is enabled or not on the system.
+* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
+  above, but for pages that have been untouched for at least two scan cycles.
+* these fields repeat up to idle_240_clean, idle_240_dirty_file and
+  idle_240_dirty_swap, allowing one to observe idle pages over a variety
+  of idle interval lengths. Note that the accounting is cumulative:
+  pages counted as idle for a given interval length are also counted
+  as idle for smaller interval lengths.
+* scans: number of physical memory scans since the cgroup was created.
+
+All the above fields are updated exactly once per scan.
+
+11.2 Responsiveness guarantees
+
+After a user page stops being touched and/or pinned, it takes at least one
+scan cycle for that page to be considered as idle and accounted as such
+in one of the idle_clean / idle_dirty_file / idle_dirty_swap counts
+(or, n scan cycles for the page to be accounted as idle in one of the
+idle_N_clean / idle_N_dirty_file / idle_N_dirty_swap counts).
+
+However, there is no guarantee that pages will be detected that fast.
+In the worst case, it could take up to two extra scan cycle intervals
+for a page to be accounted as idle. This is because after userspace stops
+touching the page, it may take up to one scan interval before we next
+scan it (at which point the page will be seen as not idle yet since it
+was touched during the previous scan) and after the page is finally scanned
+again and detected as idle, it may take up to one extra scan interval before
+completing the physical memory scan and exporting the updated statistics.
+
+Conversely, when userspace touches or pins a page that was previously
+accounted for as idle, it may take up to two scan intervals before the
+corresponding statistics are updated. Once again, this is because it may
+take up to one scan interval before scanning the page and finding it not
+idle anymore, and up to one extra scan interval before completing the
+physical memory scan and exporting the updated statistics.
+
+11.3 Incremental idle page tracking
+
+In some situations, it is desired to obtain faster feedback when
+previously idle, clean user pages start being touched. Remember that
+unpinned clean pages are immediately reclaimable by the MM's LRU
+algorithms. A high number of such pages being idle in a given cgroup
+indicates that this cgroup is not experiencing high memory pressure.
+A decrease of that number can be seen as a leading indicator that
+memory pressure is about to increase, and it may be desired to act
+upon that indication before the two scan interval measurement delay.
+
+The incremental idle page tracking feature can be used for that case.
+It allows for tracking of idle clean pages only, and only for a
+predetermined number of scan intervals (no histogram functionality as
+in the main interface).
+
+The desired idle period must first be selected on a per-cgroup basis
+by writing an integer to the memory.stale_page_age file. The integer
+is the interval we want pages to be idle for, expressed in scan cycles.
+For example to check for pages that have been idle for 5 consecutive
+scan cycles (equivalent to the idle_5_clean statistic), one would
+write 5 to the memory.stale_page_age file. The default value for the
+memory.stale_page_age file is 0, which disables the incremental idle
+page tracking feature.
+
+During scanning, clean unpinned pages that have not been touched for the
+chosen number of scan cycles are incrementally accounted for and reflected
+in the "stale" statistic in memory.idle_page_stats. Likewise, pages that
+were previously accounted as stale and are found not to be idle anymore
+are also incrementally accounted for. Additionally, any pages that are
+being considered by the LRU replacement algorithm and found to have been
+touched are also incrementally accounted for.
+
+
+12. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/mm/Kconfig b/mm/Kconfig
index 8ca47a5..f6443a0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -370,3 +370,13 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config KSTALED
+       depends on CGROUP_MEM_RES_CTLR && 64BIT
+       bool "Per-cgroup idle page tracking"
+       help
+         This feature allows the kernel to report the amount of user pages
+	 in a cgroup that have not been touched in a given time.
+	 This information may be used to size the cgroups and/or for
+	 job placement within a compute cluster.
+	 See Documentation/cgroups/memory.txt for a more complete description.
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 3/9] kstaled: page_referenced_kstaled() and supporting infrastructure.
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:49   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Add a new page_referenced_kstaled() interface. The desired behavior
is that page_referenced() returns page references since the last
page_referenced() call, and page_referenced_kstaled() returns page
references since the last page_referenced_kstaled() call, but they
are both independent of each other and do not influence each other.

The following events are counted as kstaled page references:
- CPU data access to the page (as noticed through pte_young());
- mark_page_accessed() calls;
- page being freed / reallocated.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/page-flags.h |   35 ++++++++++++++++++++++
 include/linux/rmap.h       |   68 +++++++++++++++++++++++++++++++++++++++----
 mm/rmap.c                  |   62 ++++++++++++++++++++++++++++-----------
 mm/swap.c                  |    1 +
 4 files changed, 141 insertions(+), 25 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6081493..e964d98 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -51,6 +51,13 @@
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
+ *
+ * PG_young indicates that kstaled cleared the young bit on some PTEs pointing
+ * to that page. In order to avoid interacting with the LRU algorithm, we want
+ * the next page_referenced() call to still consider the page young.
+ *
+ * PG_idle indicates that the page has not been referenced since the last time
+ * kstaled scanned it.
  */
 
 /*
@@ -107,6 +114,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_KSTALED
+	PG_young,		/* kstaled cleared pte_young */
+	PG_idle,		/* idle since start of kstaled interval */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -278,6 +289,30 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#ifdef CONFIG_KSTALED
+
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+
+static inline void set_page_young(struct page *page)
+{
+	if (!PageYoung(page))
+		SetPageYoung(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	if (PageIdle(page))
+		ClearPageIdle(page);
+}
+
+#else /* !CONFIG_KSTALED */
+
+static inline void set_page_young(struct page *page) {}
+static inline void clear_page_idle(struct page *page) {}
+
+#endif /* CONFIG_KSTALED */
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 82fef42..88a0b85 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -74,6 +74,8 @@ struct page_referenced_info {
 	unsigned long vm_flags;
 	unsigned int pr_flags;
 #define PR_REFERENCED  1
+#define PR_DIRTY       2
+#define PR_FOR_KSTALED 4
 };
 
 #ifdef CONFIG_MMU
@@ -165,8 +167,8 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-void page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt,
-		     struct page_referenced_info *info);
+void __page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt,
+		       struct page_referenced_info *info);
 void page_referenced_one(struct page *, struct vm_area_struct *,
 			 unsigned long address, unsigned int *mapcount,
 			 struct page_referenced_info *info);
@@ -244,12 +246,10 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 #define anon_vma_prepare(vma)	(0)
 #define anon_vma_link(vma)	do {} while (0)
 
-static inline void page_referenced(struct page *page, int is_locked,
-				   struct mem_cgroup *cnt,
-				   struct page_referenced_info *info)
+static inline void __page_referenced(struct page *page, int is_locked,
+				     struct mem_cgroup *cnt,
+				     struct page_referenced_info *info)
 {
-	info->vm_flags = 0;
-	info->pr_flags = 0;
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
@@ -262,6 +262,60 @@ static inline int page_mkclean(struct page *page)
 
 #endif	/* CONFIG_MMU */
 
+/**
+ * page_referenced - test if the page was referenced
+ * @page: the page to test
+ * @is_locked: caller holds lock on the page
+ * @mem_cont: target memory controller
+ * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ *
+ * Quick test_and_clear_referenced for all mappings to a page,
+ * returns the number of ptes which referenced the page.
+ */
+static inline void page_referenced(struct page *page,
+				   int is_locked,
+				   struct mem_cgroup *mem_cont,
+				   struct page_referenced_info *info)
+{
+	info->vm_flags = 0;
+	info->pr_flags = 0;
+
+#ifdef CONFIG_KSTALED
+	/*
+	 * Always clear PageYoung at the start of a scanning interval. It will
+	 * get get set if kstaled clears a young bit in a pte reference,
+	 * so that vmscan will still see the page as referenced.
+	 */
+	if (PageYoung(page)) {
+		ClearPageYoung(page);
+		info->pr_flags |= PR_REFERENCED;
+	}
+#endif
+
+	__page_referenced(page, is_locked, mem_cont, info);
+}
+
+#ifdef CONFIG_KSTALED
+static inline void page_referenced_kstaled(struct page *page, bool is_locked,
+					   struct page_referenced_info *info)
+{
+	info->vm_flags = 0;
+	info->pr_flags = PR_FOR_KSTALED;
+
+	/*
+	 * Always set PageIdle at the start of a scanning interval. It will
+	 * get cleared if a young page reference is encountered; otherwise
+	 * the page will be counted as idle at the next kstaled scan cycle.
+	 */
+	if (!PageIdle(page)) {
+		SetPageIdle(page);
+		info->pr_flags |= PR_REFERENCED;
+	}
+
+	__page_referenced(page, is_locked, NULL, info);
+}
+#endif
+
 /*
  * Return values of try_to_unmap
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index f87afd0..fa8440e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -670,6 +670,8 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return;
 		}
 
+		info->pr_flags |= PR_DIRTY;
+
 		if (vma->vm_flags & VM_LOCKED) {
 			spin_unlock(&mm->page_table_lock);
 			*mapcount = 0;	/* break early from loop */
@@ -678,8 +680,17 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		/* go ahead even if the pmd is pmd_trans_splitting() */
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced = true;
+		if (!(info->pr_flags & PR_FOR_KSTALED)) {
+			if (pmdp_clear_flush_young_notify(vma, address, pmd)) {
+				referenced = true;
+				clear_page_idle(page);
+			}
+		} else {
+			if (pmdp_test_and_clear_young(vma, address, pmd)) {
+				referenced = true;
+				set_page_young(page);
+			}
+		}
 		spin_unlock(&mm->page_table_lock);
 	} else {
 		pte_t *pte;
@@ -693,6 +704,9 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (!pte)
 			return;
 
+		if (pte_dirty(*pte))
+			info->pr_flags |= PR_DIRTY;
+
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
@@ -700,23 +714,38 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return;
 		}
 
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+		if (!(info->pr_flags & PR_FOR_KSTALED)) {
+			if (ptep_clear_flush_young_notify(vma, address, pte)) {
+				/*
+				 * Don't treat a reference through a
+				 * sequentially read mapping as such.
+				 * If the page has been used in another
+				 * mapping, we will catch it; if this other
+				 * mapping is already gone, the unmap path
+				 * will have set PG_referenced or activated
+				 * the page.
+				 */
+				if (likely(!VM_SequentialReadHint(vma)))
+					referenced = true;
+				clear_page_idle(page);
+			}
+		} else {
 			/*
-			 * Don't treat a reference through a sequentially read
-			 * mapping as such.  If the page has been used in
-			 * another mapping, we will catch it; if this other
-			 * mapping is already gone, the unmap path will have
-			 * set PG_referenced or activated the page.
+			 * Within page_referenced_kstaled():
+			 * skip TLB shootdown & VM_SequentialReadHint heuristic
 			 */
-			if (likely(!VM_SequentialReadHint(vma)))
+			if (ptep_test_and_clear_young(vma, address, pte)) {
 				referenced = true;
+				set_page_young(page);
+			}
 		}
 		pte_unmap_unlock(pte, ptl);
 	}
 
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
-	if (mm != current->mm && has_swap_token(mm) &&
+	if (!(info->pr_flags & PR_FOR_KSTALED) &&
+			mm != current->mm && has_swap_token(mm) &&
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced = true;
 
@@ -828,7 +857,7 @@ static void page_referenced_file(struct page *page,
 }
 
 /**
- * page_referenced - test if the page was referenced
+ * __page_referenced - test if the page was referenced
  * @page: the page to test
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
@@ -838,16 +867,13 @@ static void page_referenced_file(struct page *page,
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-void page_referenced(struct page *page,
-		     int is_locked,
-		     struct mem_cgroup *mem_cont,
-		     struct page_referenced_info *info)
+void __page_referenced(struct page *page,
+		       int is_locked,
+		       struct mem_cgroup *mem_cont,
+		       struct page_referenced_info *info)
 {
 	int we_locked = 0;
 
-	info->vm_flags = 0;
-	info->pr_flags = 0;
-
 	if (page_mapped(page) && page_rmapping(page)) {
 		if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
 			we_locked = trylock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..d65b69e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -344,6 +344,7 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	clear_page_idle(page);
 }
 
 EXPORT_SYMBOL(mark_page_accessed);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 3/9] kstaled: page_referenced_kstaled() and supporting infrastructure.
@ 2011-09-28  0:49   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Add a new page_referenced_kstaled() interface. The desired behavior
is that page_referenced() returns page references since the last
page_referenced() call, and page_referenced_kstaled() returns page
references since the last page_referenced_kstaled() call, but they
are both independent of each other and do not influence each other.

The following events are counted as kstaled page references:
- CPU data access to the page (as noticed through pte_young());
- mark_page_accessed() calls;
- page being freed / reallocated.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/page-flags.h |   35 ++++++++++++++++++++++
 include/linux/rmap.h       |   68 +++++++++++++++++++++++++++++++++++++++----
 mm/rmap.c                  |   62 ++++++++++++++++++++++++++++-----------
 mm/swap.c                  |    1 +
 4 files changed, 141 insertions(+), 25 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6081493..e964d98 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -51,6 +51,13 @@
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
+ *
+ * PG_young indicates that kstaled cleared the young bit on some PTEs pointing
+ * to that page. In order to avoid interacting with the LRU algorithm, we want
+ * the next page_referenced() call to still consider the page young.
+ *
+ * PG_idle indicates that the page has not been referenced since the last time
+ * kstaled scanned it.
  */
 
 /*
@@ -107,6 +114,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_KSTALED
+	PG_young,		/* kstaled cleared pte_young */
+	PG_idle,		/* idle since start of kstaled interval */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -278,6 +289,30 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#ifdef CONFIG_KSTALED
+
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+
+static inline void set_page_young(struct page *page)
+{
+	if (!PageYoung(page))
+		SetPageYoung(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+	if (PageIdle(page))
+		ClearPageIdle(page);
+}
+
+#else /* !CONFIG_KSTALED */
+
+static inline void set_page_young(struct page *page) {}
+static inline void clear_page_idle(struct page *page) {}
+
+#endif /* CONFIG_KSTALED */
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 82fef42..88a0b85 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -74,6 +74,8 @@ struct page_referenced_info {
 	unsigned long vm_flags;
 	unsigned int pr_flags;
 #define PR_REFERENCED  1
+#define PR_DIRTY       2
+#define PR_FOR_KSTALED 4
 };
 
 #ifdef CONFIG_MMU
@@ -165,8 +167,8 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-void page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt,
-		     struct page_referenced_info *info);
+void __page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt,
+		       struct page_referenced_info *info);
 void page_referenced_one(struct page *, struct vm_area_struct *,
 			 unsigned long address, unsigned int *mapcount,
 			 struct page_referenced_info *info);
@@ -244,12 +246,10 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 #define anon_vma_prepare(vma)	(0)
 #define anon_vma_link(vma)	do {} while (0)
 
-static inline void page_referenced(struct page *page, int is_locked,
-				   struct mem_cgroup *cnt,
-				   struct page_referenced_info *info)
+static inline void __page_referenced(struct page *page, int is_locked,
+				     struct mem_cgroup *cnt,
+				     struct page_referenced_info *info)
 {
-	info->vm_flags = 0;
-	info->pr_flags = 0;
 }
 
 #define try_to_unmap(page, refs) SWAP_FAIL
@@ -262,6 +262,60 @@ static inline int page_mkclean(struct page *page)
 
 #endif	/* CONFIG_MMU */
 
+/**
+ * page_referenced - test if the page was referenced
+ * @page: the page to test
+ * @is_locked: caller holds lock on the page
+ * @mem_cont: target memory controller
+ * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ *
+ * Quick test_and_clear_referenced for all mappings to a page,
+ * returns the number of ptes which referenced the page.
+ */
+static inline void page_referenced(struct page *page,
+				   int is_locked,
+				   struct mem_cgroup *mem_cont,
+				   struct page_referenced_info *info)
+{
+	info->vm_flags = 0;
+	info->pr_flags = 0;
+
+#ifdef CONFIG_KSTALED
+	/*
+	 * Always clear PageYoung at the start of a scanning interval. It will
+	 * get get set if kstaled clears a young bit in a pte reference,
+	 * so that vmscan will still see the page as referenced.
+	 */
+	if (PageYoung(page)) {
+		ClearPageYoung(page);
+		info->pr_flags |= PR_REFERENCED;
+	}
+#endif
+
+	__page_referenced(page, is_locked, mem_cont, info);
+}
+
+#ifdef CONFIG_KSTALED
+static inline void page_referenced_kstaled(struct page *page, bool is_locked,
+					   struct page_referenced_info *info)
+{
+	info->vm_flags = 0;
+	info->pr_flags = PR_FOR_KSTALED;
+
+	/*
+	 * Always set PageIdle at the start of a scanning interval. It will
+	 * get cleared if a young page reference is encountered; otherwise
+	 * the page will be counted as idle at the next kstaled scan cycle.
+	 */
+	if (!PageIdle(page)) {
+		SetPageIdle(page);
+		info->pr_flags |= PR_REFERENCED;
+	}
+
+	__page_referenced(page, is_locked, NULL, info);
+}
+#endif
+
 /*
  * Return values of try_to_unmap
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index f87afd0..fa8440e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -670,6 +670,8 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return;
 		}
 
+		info->pr_flags |= PR_DIRTY;
+
 		if (vma->vm_flags & VM_LOCKED) {
 			spin_unlock(&mm->page_table_lock);
 			*mapcount = 0;	/* break early from loop */
@@ -678,8 +680,17 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		/* go ahead even if the pmd is pmd_trans_splitting() */
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced = true;
+		if (!(info->pr_flags & PR_FOR_KSTALED)) {
+			if (pmdp_clear_flush_young_notify(vma, address, pmd)) {
+				referenced = true;
+				clear_page_idle(page);
+			}
+		} else {
+			if (pmdp_test_and_clear_young(vma, address, pmd)) {
+				referenced = true;
+				set_page_young(page);
+			}
+		}
 		spin_unlock(&mm->page_table_lock);
 	} else {
 		pte_t *pte;
@@ -693,6 +704,9 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (!pte)
 			return;
 
+		if (pte_dirty(*pte))
+			info->pr_flags |= PR_DIRTY;
+
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
 			*mapcount = 0;	/* break early from loop */
@@ -700,23 +714,38 @@ void page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return;
 		}
 
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+		if (!(info->pr_flags & PR_FOR_KSTALED)) {
+			if (ptep_clear_flush_young_notify(vma, address, pte)) {
+				/*
+				 * Don't treat a reference through a
+				 * sequentially read mapping as such.
+				 * If the page has been used in another
+				 * mapping, we will catch it; if this other
+				 * mapping is already gone, the unmap path
+				 * will have set PG_referenced or activated
+				 * the page.
+				 */
+				if (likely(!VM_SequentialReadHint(vma)))
+					referenced = true;
+				clear_page_idle(page);
+			}
+		} else {
 			/*
-			 * Don't treat a reference through a sequentially read
-			 * mapping as such.  If the page has been used in
-			 * another mapping, we will catch it; if this other
-			 * mapping is already gone, the unmap path will have
-			 * set PG_referenced or activated the page.
+			 * Within page_referenced_kstaled():
+			 * skip TLB shootdown & VM_SequentialReadHint heuristic
 			 */
-			if (likely(!VM_SequentialReadHint(vma)))
+			if (ptep_test_and_clear_young(vma, address, pte)) {
 				referenced = true;
+				set_page_young(page);
+			}
 		}
 		pte_unmap_unlock(pte, ptl);
 	}
 
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
-	if (mm != current->mm && has_swap_token(mm) &&
+	if (!(info->pr_flags & PR_FOR_KSTALED) &&
+			mm != current->mm && has_swap_token(mm) &&
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced = true;
 
@@ -828,7 +857,7 @@ static void page_referenced_file(struct page *page,
 }
 
 /**
- * page_referenced - test if the page was referenced
+ * __page_referenced - test if the page was referenced
  * @page: the page to test
  * @is_locked: caller holds lock on the page
  * @mem_cont: target memory controller
@@ -838,16 +867,13 @@ static void page_referenced_file(struct page *page,
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
  */
-void page_referenced(struct page *page,
-		     int is_locked,
-		     struct mem_cgroup *mem_cont,
-		     struct page_referenced_info *info)
+void __page_referenced(struct page *page,
+		       int is_locked,
+		       struct mem_cgroup *mem_cont,
+		       struct page_referenced_info *info)
 {
 	int we_locked = 0;
 
-	info->vm_flags = 0;
-	info->pr_flags = 0;
-
 	if (page_mapped(page) && page_rmapping(page)) {
 		if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
 			we_locked = trylock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..d65b69e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -344,6 +344,7 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	clear_page_idle(page);
 }
 
 EXPORT_SYMBOL(mark_page_accessed);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 4/9] kstaled: minimalistic implementation.
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:49   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Introduce minimal kstaled implementation. The scan rate is controlled by
/sys/kernel/mm/kstaled/scan_seconds and per-cgroup statistics are output
into /dev/cgroup/*/memory.idle_page_stats.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memcontrol.c |  297 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 297 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..e55056f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,8 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/kthread.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -283,6 +285,16 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+#ifdef CONFIG_KSTALED
+	seqcount_t idle_page_stats_lock;
+	struct idle_page_stats {
+		unsigned long idle_clean;
+		unsigned long idle_dirty_file;
+		unsigned long idle_dirty_swap;
+	} idle_page_stats, idle_scan_stats;
+	unsigned long idle_page_scans;
+#endif
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4668,6 +4680,30 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 }
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_KSTALED
+static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	unsigned int seqcount;
+	struct idle_page_stats stats;
+	unsigned long scans;
+
+	do {
+		seqcount = read_seqcount_begin(&memcg->idle_page_stats_lock);
+		stats = memcg->idle_page_stats;
+		scans = memcg->idle_page_scans;
+	} while (read_seqcount_retry(&memcg->idle_page_stats_lock, seqcount));
+
+	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
+	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
+	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
+	cb->fill(cb, "scans", scans);
+
+	return 0;
+}
+#endif /* CONFIG_KSTALED */
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4738,6 +4774,12 @@ static struct cftype mem_cgroup_files[] = {
 		.mode = S_IRUGO,
 	},
 #endif
+#ifdef CONFIG_KSTALED
+	{
+		.name = "idle_page_stats",
+		.read_map = mem_cgroup_idle_page_stats_read,
+	},
+#endif
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -5001,6 +5043,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
+#ifdef CONFIG_KSTALED
+	seqcount_init(&mem->idle_page_stats_lock);
+#endif
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
@@ -5568,3 +5613,255 @@ static int __init enable_swap_account(char *s)
 __setup("swapaccount=", enable_swap_account);
 
 #endif
+
+#ifdef CONFIG_KSTALED
+
+static unsigned int kstaled_scan_seconds;
+static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
+
+static unsigned kstaled_scan_page(struct page *page)
+{
+	bool is_locked = false;
+	bool is_file;
+	struct page_referenced_info info;
+	struct page_cgroup *pc;
+	struct idle_page_stats *stats;
+	unsigned nr_pages;
+
+	/*
+	 * Before taking the page reference, check if the page is
+	 * a user page which is not obviously unreclaimable
+	 * (we will do more complete checks later).
+	 */
+	if (!PageLRU(page) ||
+	    (!PageCompound(page) &&
+	     (PageMlocked(page) ||
+	      (page->mapping == NULL && !PageSwapCache(page)))))
+		return 1;
+
+	if (!get_page_unless_zero(page))
+		return 1;
+
+	/* Recheck now that we have the page reference. */
+	if (unlikely(!PageLRU(page)))
+		goto out;
+	nr_pages = 1 << compound_trans_order(page);
+	if (PageMlocked(page))
+		goto out;
+
+	/*
+	 * Anon and SwapCache pages can be identified without locking.
+	 * For all other cases, we need the page locked in order to
+	 * dereference page->mapping.
+	 */
+	if (PageAnon(page) || PageSwapCache(page))
+		is_file = false;
+	else if (!trylock_page(page)) {
+		/*
+		 * We need to lock the page to dereference the mapping.
+		 * But don't risk sleeping by calling lock_page().
+		 * We don't want to stall kstaled, so we conservatively
+		 * count locked pages as unreclaimable.
+		 */
+		goto out;
+	} else {
+		struct address_space *mapping = page->mapping;
+
+		is_locked = true;
+
+		/*
+		 * The page is still anon - it has been continuously referenced
+		 * since the prior check.
+		 */
+		VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
+
+		/*
+		 * Check the mapping under protection of the page lock.
+		 * 1. If the page is not swap cache and has no mapping,
+		 *    shrink_page_list can't do anything with it.
+		 * 2. If the mapping is unevictable (as in SHM_LOCK segments),
+		 *    shrink_page_list can't do anything with it.
+		 * 3. If the page is swap cache or the mapping is swap backed
+		 *    (as in shmem), consider it a swappable page.
+		 * 4. If the backing dev has indicated that it does not want
+		 *    its pages sync'd to disk (as in ramfs), take this as
+		 *    a hint that its pages are not reclaimable.
+		 * 5. Otherwise, consider this as a file page reclaimable
+		 *    through standard pageout.
+		 */
+		if (!mapping && !PageSwapCache(page))
+			goto out;
+		else if (mapping_unevictable(mapping))
+			goto out;
+		else if (PageSwapCache(page) ||
+			 mapping_cap_swap_backed(mapping))
+			is_file = false;
+		else if (!mapping_cap_writeback_dirty(mapping))
+			goto out;
+		else
+			is_file = true;
+	}
+
+	/* Find out if the page is idle. Also test for pending mlock. */
+	page_referenced_kstaled(page, is_locked, &info);
+	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
+		goto out;
+
+	/* Locate kstaled stats for the page's cgroup. */
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		goto out;
+	lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc))
+		goto unlock_page_cgroup_out;
+	stats = &pc->mem_cgroup->idle_scan_stats;
+
+	/* Finally increment the correct statistic for this page. */
+	if (!(info.pr_flags & PR_DIRTY) &&
+	    !PageDirty(page) && !PageWriteback(page))
+		stats->idle_clean += nr_pages;
+	else if (is_file)
+		stats->idle_dirty_file += nr_pages;
+	else
+		stats->idle_dirty_swap += nr_pages;
+
+ unlock_page_cgroup_out:
+	unlock_page_cgroup(pc);
+
+ out:
+	if (is_locked)
+		unlock_page(page);
+	put_page(page);
+
+	return nr_pages;
+}
+
+static void kstaled_scan_node(pg_data_t *pgdat)
+{
+	unsigned long flags;
+	unsigned long pfn, end;
+
+	pgdat_resize_lock(pgdat, &flags);
+
+	pfn = pgdat->node_start_pfn;
+	end = pfn + pgdat->node_spanned_pages;
+
+	while (pfn < end) {
+		if (need_resched()) {
+			pgdat_resize_unlock(pgdat, &flags);
+			cond_resched();
+			pgdat_resize_lock(pgdat, &flags);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+			/* abort if the node got resized */
+			if (pfn < pgdat->node_start_pfn ||
+			    end > (pgdat->node_start_pfn +
+				   pgdat->node_spanned_pages))
+				goto abort;
+#endif
+		}
+
+		pfn += pfn_valid(pfn) ?
+			kstaled_scan_page(pfn_to_page(pfn)) : 1;
+	}
+
+abort:
+	pgdat_resize_unlock(pgdat, &flags);
+}
+
+static int kstaled(void *dummy)
+{
+	while (1) {
+		int scan_seconds;
+		int nid;
+		struct mem_cgroup *memcg;
+
+		wait_event_interruptible(kstaled_wait,
+				 (scan_seconds = kstaled_scan_seconds) > 0);
+		/*
+		 * We use interruptible wait_event so as not to contribute
+		 * to the machine load average while we're sleeping.
+		 * However, we don't actually expect to receive a signal
+		 * since we run as a kernel thread, so the condition we were
+		 * waiting for should be true once we get here.
+		 */
+		BUG_ON(scan_seconds <= 0);
+
+		for_each_mem_cgroup_all(memcg)
+			memset(&memcg->idle_scan_stats, 0,
+			       sizeof(memcg->idle_scan_stats));
+
+		for_each_node_state(nid, N_HIGH_MEMORY)
+			kstaled_scan_node(NODE_DATA(nid));
+
+		for_each_mem_cgroup_all(memcg) {
+			write_seqcount_begin(&memcg->idle_page_stats_lock);
+			memcg->idle_page_stats = memcg->idle_scan_stats;
+			memcg->idle_page_scans++;
+			write_seqcount_end(&memcg->idle_page_stats_lock);
+		}
+
+		schedule_timeout_interruptible(scan_seconds * HZ);
+	}
+
+	BUG();
+	return 0;	/* NOT REACHED */
+}
+
+static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", kstaled_scan_seconds);
+}
+
+static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+
+	err = kstrtoul(buf, 10, &input);
+	if (err)
+		return -EINVAL;
+	kstaled_scan_seconds = input;
+	wake_up_interruptible(&kstaled_wait);
+	return count;
+}
+
+static struct kobj_attribute kstaled_scan_seconds_attr = __ATTR(
+	scan_seconds, 0644,
+	kstaled_scan_seconds_show, kstaled_scan_seconds_store);
+
+static struct attribute *kstaled_attrs[] = {
+	&kstaled_scan_seconds_attr.attr,
+	NULL
+};
+static struct attribute_group kstaled_attr_group = {
+	.name = "kstaled",
+	.attrs = kstaled_attrs,
+};
+
+static int __init kstaled_init(void)
+{
+	int error;
+	struct task_struct *thread;
+
+	error = sysfs_create_group(mm_kobj, &kstaled_attr_group);
+	if (error) {
+		pr_err("Failed to create kstaled sysfs node\n");
+		return error;
+	}
+
+	thread = kthread_run(kstaled, NULL, "kstaled");
+	if (IS_ERR(thread)) {
+		pr_err("Failed to start kstaled\n");
+		return PTR_ERR(thread);
+	}
+
+	return 0;
+}
+module_init(kstaled_init);
+
+#endif /* CONFIG_KSTALED */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 4/9] kstaled: minimalistic implementation.
@ 2011-09-28  0:49   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Introduce minimal kstaled implementation. The scan rate is controlled by
/sys/kernel/mm/kstaled/scan_seconds and per-cgroup statistics are output
into /dev/cgroup/*/memory.idle_page_stats.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memcontrol.c |  297 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 297 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..e55056f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,8 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/kthread.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -283,6 +285,16 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+#ifdef CONFIG_KSTALED
+	seqcount_t idle_page_stats_lock;
+	struct idle_page_stats {
+		unsigned long idle_clean;
+		unsigned long idle_dirty_file;
+		unsigned long idle_dirty_swap;
+	} idle_page_stats, idle_scan_stats;
+	unsigned long idle_page_scans;
+#endif
 };
 
 /* Stuffs for move charges at task migration. */
@@ -4668,6 +4680,30 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 }
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_KSTALED
+static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
+	struct cftype *cft,  struct cgroup_map_cb *cb)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	unsigned int seqcount;
+	struct idle_page_stats stats;
+	unsigned long scans;
+
+	do {
+		seqcount = read_seqcount_begin(&memcg->idle_page_stats_lock);
+		stats = memcg->idle_page_stats;
+		scans = memcg->idle_page_scans;
+	} while (read_seqcount_retry(&memcg->idle_page_stats_lock, seqcount));
+
+	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
+	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
+	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
+	cb->fill(cb, "scans", scans);
+
+	return 0;
+}
+#endif /* CONFIG_KSTALED */
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4738,6 +4774,12 @@ static struct cftype mem_cgroup_files[] = {
 		.mode = S_IRUGO,
 	},
 #endif
+#ifdef CONFIG_KSTALED
+	{
+		.name = "idle_page_stats",
+		.read_map = mem_cgroup_idle_page_stats_read,
+	},
+#endif
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -5001,6 +5043,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
+#ifdef CONFIG_KSTALED
+	seqcount_init(&mem->idle_page_stats_lock);
+#endif
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
@@ -5568,3 +5613,255 @@ static int __init enable_swap_account(char *s)
 __setup("swapaccount=", enable_swap_account);
 
 #endif
+
+#ifdef CONFIG_KSTALED
+
+static unsigned int kstaled_scan_seconds;
+static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
+
+static unsigned kstaled_scan_page(struct page *page)
+{
+	bool is_locked = false;
+	bool is_file;
+	struct page_referenced_info info;
+	struct page_cgroup *pc;
+	struct idle_page_stats *stats;
+	unsigned nr_pages;
+
+	/*
+	 * Before taking the page reference, check if the page is
+	 * a user page which is not obviously unreclaimable
+	 * (we will do more complete checks later).
+	 */
+	if (!PageLRU(page) ||
+	    (!PageCompound(page) &&
+	     (PageMlocked(page) ||
+	      (page->mapping == NULL && !PageSwapCache(page)))))
+		return 1;
+
+	if (!get_page_unless_zero(page))
+		return 1;
+
+	/* Recheck now that we have the page reference. */
+	if (unlikely(!PageLRU(page)))
+		goto out;
+	nr_pages = 1 << compound_trans_order(page);
+	if (PageMlocked(page))
+		goto out;
+
+	/*
+	 * Anon and SwapCache pages can be identified without locking.
+	 * For all other cases, we need the page locked in order to
+	 * dereference page->mapping.
+	 */
+	if (PageAnon(page) || PageSwapCache(page))
+		is_file = false;
+	else if (!trylock_page(page)) {
+		/*
+		 * We need to lock the page to dereference the mapping.
+		 * But don't risk sleeping by calling lock_page().
+		 * We don't want to stall kstaled, so we conservatively
+		 * count locked pages as unreclaimable.
+		 */
+		goto out;
+	} else {
+		struct address_space *mapping = page->mapping;
+
+		is_locked = true;
+
+		/*
+		 * The page is still anon - it has been continuously referenced
+		 * since the prior check.
+		 */
+		VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
+
+		/*
+		 * Check the mapping under protection of the page lock.
+		 * 1. If the page is not swap cache and has no mapping,
+		 *    shrink_page_list can't do anything with it.
+		 * 2. If the mapping is unevictable (as in SHM_LOCK segments),
+		 *    shrink_page_list can't do anything with it.
+		 * 3. If the page is swap cache or the mapping is swap backed
+		 *    (as in shmem), consider it a swappable page.
+		 * 4. If the backing dev has indicated that it does not want
+		 *    its pages sync'd to disk (as in ramfs), take this as
+		 *    a hint that its pages are not reclaimable.
+		 * 5. Otherwise, consider this as a file page reclaimable
+		 *    through standard pageout.
+		 */
+		if (!mapping && !PageSwapCache(page))
+			goto out;
+		else if (mapping_unevictable(mapping))
+			goto out;
+		else if (PageSwapCache(page) ||
+			 mapping_cap_swap_backed(mapping))
+			is_file = false;
+		else if (!mapping_cap_writeback_dirty(mapping))
+			goto out;
+		else
+			is_file = true;
+	}
+
+	/* Find out if the page is idle. Also test for pending mlock. */
+	page_referenced_kstaled(page, is_locked, &info);
+	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
+		goto out;
+
+	/* Locate kstaled stats for the page's cgroup. */
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		goto out;
+	lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc))
+		goto unlock_page_cgroup_out;
+	stats = &pc->mem_cgroup->idle_scan_stats;
+
+	/* Finally increment the correct statistic for this page. */
+	if (!(info.pr_flags & PR_DIRTY) &&
+	    !PageDirty(page) && !PageWriteback(page))
+		stats->idle_clean += nr_pages;
+	else if (is_file)
+		stats->idle_dirty_file += nr_pages;
+	else
+		stats->idle_dirty_swap += nr_pages;
+
+ unlock_page_cgroup_out:
+	unlock_page_cgroup(pc);
+
+ out:
+	if (is_locked)
+		unlock_page(page);
+	put_page(page);
+
+	return nr_pages;
+}
+
+static void kstaled_scan_node(pg_data_t *pgdat)
+{
+	unsigned long flags;
+	unsigned long pfn, end;
+
+	pgdat_resize_lock(pgdat, &flags);
+
+	pfn = pgdat->node_start_pfn;
+	end = pfn + pgdat->node_spanned_pages;
+
+	while (pfn < end) {
+		if (need_resched()) {
+			pgdat_resize_unlock(pgdat, &flags);
+			cond_resched();
+			pgdat_resize_lock(pgdat, &flags);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+			/* abort if the node got resized */
+			if (pfn < pgdat->node_start_pfn ||
+			    end > (pgdat->node_start_pfn +
+				   pgdat->node_spanned_pages))
+				goto abort;
+#endif
+		}
+
+		pfn += pfn_valid(pfn) ?
+			kstaled_scan_page(pfn_to_page(pfn)) : 1;
+	}
+
+abort:
+	pgdat_resize_unlock(pgdat, &flags);
+}
+
+static int kstaled(void *dummy)
+{
+	while (1) {
+		int scan_seconds;
+		int nid;
+		struct mem_cgroup *memcg;
+
+		wait_event_interruptible(kstaled_wait,
+				 (scan_seconds = kstaled_scan_seconds) > 0);
+		/*
+		 * We use interruptible wait_event so as not to contribute
+		 * to the machine load average while we're sleeping.
+		 * However, we don't actually expect to receive a signal
+		 * since we run as a kernel thread, so the condition we were
+		 * waiting for should be true once we get here.
+		 */
+		BUG_ON(scan_seconds <= 0);
+
+		for_each_mem_cgroup_all(memcg)
+			memset(&memcg->idle_scan_stats, 0,
+			       sizeof(memcg->idle_scan_stats));
+
+		for_each_node_state(nid, N_HIGH_MEMORY)
+			kstaled_scan_node(NODE_DATA(nid));
+
+		for_each_mem_cgroup_all(memcg) {
+			write_seqcount_begin(&memcg->idle_page_stats_lock);
+			memcg->idle_page_stats = memcg->idle_scan_stats;
+			memcg->idle_page_scans++;
+			write_seqcount_end(&memcg->idle_page_stats_lock);
+		}
+
+		schedule_timeout_interruptible(scan_seconds * HZ);
+	}
+
+	BUG();
+	return 0;	/* NOT REACHED */
+}
+
+static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", kstaled_scan_seconds);
+}
+
+static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+
+	err = kstrtoul(buf, 10, &input);
+	if (err)
+		return -EINVAL;
+	kstaled_scan_seconds = input;
+	wake_up_interruptible(&kstaled_wait);
+	return count;
+}
+
+static struct kobj_attribute kstaled_scan_seconds_attr = __ATTR(
+	scan_seconds, 0644,
+	kstaled_scan_seconds_show, kstaled_scan_seconds_store);
+
+static struct attribute *kstaled_attrs[] = {
+	&kstaled_scan_seconds_attr.attr,
+	NULL
+};
+static struct attribute_group kstaled_attr_group = {
+	.name = "kstaled",
+	.attrs = kstaled_attrs,
+};
+
+static int __init kstaled_init(void)
+{
+	int error;
+	struct task_struct *thread;
+
+	error = sysfs_create_group(mm_kobj, &kstaled_attr_group);
+	if (error) {
+		pr_err("Failed to create kstaled sysfs node\n");
+		return error;
+	}
+
+	thread = kthread_run(kstaled, NULL, "kstaled");
+	if (IS_ERR(thread)) {
+		pr_err("Failed to start kstaled\n");
+		return PTR_ERR(thread);
+	}
+
+	return 0;
+}
+module_init(kstaled_init);
+
+#endif /* CONFIG_KSTALED */
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 5/9] kstaled: skip non-RAM regions.
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:49   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Add a pfn_skip_hole function that shrinks the passed input range in order to
skip over pfn ranges that are known not bo be RAM backed. The x86
implementation achieves this using e820 tables; other architectures
use a generic no-op implementation.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/page_types.h |    8 ++++++
 arch/x86/kernel/e820.c            |   45 +++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h            |    6 +++++
 mm/memcontrol.c                   |   31 +++++++++++++++----------
 4 files changed, 78 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index bce688d..b0676c2 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -57,6 +57,14 @@ extern unsigned long init_memory_mapping(unsigned long start,
 extern void initmem_init(void);
 extern void free_initmem(void);
 
+extern void e820_skip_hole(unsigned long *start_pfn, unsigned long *end_pfn);
+
+#define ARCH_HAVE_PFN_SKIP_HOLE 1
+static inline void pfn_skip_hole(unsigned long *start, unsigned long *end)
+{
+	e820_skip_hole(start, end);
+}
+
 #endif	/* !__ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PAGE_DEFS_H */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 3e2ef84..0677873 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1123,3 +1123,48 @@ void __init memblock_find_dma_reserve(void)
 	set_dma_reserve(mem_size_pfn - free_size_pfn);
 #endif
 }
+
+/*
+ * The caller wants to skip pfns that are guaranteed to not be valid
+ * memory. Find a stretch of ram between [start_pfn, end_pfn) and
+ * return its pfn range back through start_pfn and end_pfn.
+ */
+
+void e820_skip_hole(unsigned long *start_pfn, unsigned long *end_pfn)
+{
+	unsigned long start = *start_pfn << PAGE_SHIFT;
+	unsigned long end = *end_pfn << PAGE_SHIFT;
+	int i;
+
+	if (start >= end)
+		goto fail;		/* short-circuit e820 checks */
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		unsigned long last, addr;
+
+		addr = round_up(ei->addr, PAGE_SIZE);
+		last = round_down(ei->addr + ei->size, PAGE_SIZE);
+
+		if (addr >= end)
+			goto fail;	/* We're done, not found */
+		if (last <= start)
+			continue;	/* Not at start yet, move on */
+		if (ei->type != E820_RAM)
+			continue;	/* Not RAM, move on */
+
+		/*
+		 * We've found RAM. If start is in this e820 range, return
+		 * it, otherwise return the start of this e820 range.
+		 */
+
+		if (addr > start)
+			*start_pfn = addr >> PAGE_SHIFT;
+		if (last < end)
+			*end_pfn = last >> PAGE_SHIFT;
+		return;
+	}
+fail:
+	*start_pfn = *end_pfn;
+	return;				/* No luck, return failure */
+}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..6657106 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -930,6 +930,12 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
 #define pfn_to_nid(pfn)		(0)
 #endif
 
+#ifndef ARCH_HAVE_PFN_SKIP_HOLE
+static inline void pfn_skip_hole(unsigned long *start, unsigned long *end)
+{
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e55056f..b75d41f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5747,22 +5747,29 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 	end = pfn + pgdat->node_spanned_pages;
 
 	while (pfn < end) {
-		if (need_resched()) {
-			pgdat_resize_unlock(pgdat, &flags);
-			cond_resched();
-			pgdat_resize_lock(pgdat, &flags);
+		unsigned long contiguous = end;
+
+		/* restrict pfn..contiguous to be a RAM backed range */
+		pfn_skip_hole(&pfn, &contiguous);
+
+		while (pfn < contiguous) {
+			if (need_resched()) {
+				pgdat_resize_unlock(pgdat, &flags);
+				cond_resched();
+				pgdat_resize_lock(pgdat, &flags);
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-			/* abort if the node got resized */
-			if (pfn < pgdat->node_start_pfn ||
-			    end > (pgdat->node_start_pfn +
-				   pgdat->node_spanned_pages))
-				goto abort;
+				/* abort if the node got resized */
+				if (pfn < pgdat->node_start_pfn ||
+				    end > (pgdat->node_start_pfn +
+					   pgdat->node_spanned_pages))
+					goto abort;
 #endif
-		}
+			}
 
-		pfn += pfn_valid(pfn) ?
-			kstaled_scan_page(pfn_to_page(pfn)) : 1;
+			pfn += pfn_valid(pfn) ?
+				kstaled_scan_page(pfn_to_page(pfn)) : 1;
+		}
 	}
 
 abort:
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 5/9] kstaled: skip non-RAM regions.
@ 2011-09-28  0:49   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Add a pfn_skip_hole function that shrinks the passed input range in order to
skip over pfn ranges that are known not bo be RAM backed. The x86
implementation achieves this using e820 tables; other architectures
use a generic no-op implementation.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/page_types.h |    8 ++++++
 arch/x86/kernel/e820.c            |   45 +++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h            |    6 +++++
 mm/memcontrol.c                   |   31 +++++++++++++++----------
 4 files changed, 78 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index bce688d..b0676c2 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -57,6 +57,14 @@ extern unsigned long init_memory_mapping(unsigned long start,
 extern void initmem_init(void);
 extern void free_initmem(void);
 
+extern void e820_skip_hole(unsigned long *start_pfn, unsigned long *end_pfn);
+
+#define ARCH_HAVE_PFN_SKIP_HOLE 1
+static inline void pfn_skip_hole(unsigned long *start, unsigned long *end)
+{
+	e820_skip_hole(start, end);
+}
+
 #endif	/* !__ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PAGE_DEFS_H */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 3e2ef84..0677873 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1123,3 +1123,48 @@ void __init memblock_find_dma_reserve(void)
 	set_dma_reserve(mem_size_pfn - free_size_pfn);
 #endif
 }
+
+/*
+ * The caller wants to skip pfns that are guaranteed to not be valid
+ * memory. Find a stretch of ram between [start_pfn, end_pfn) and
+ * return its pfn range back through start_pfn and end_pfn.
+ */
+
+void e820_skip_hole(unsigned long *start_pfn, unsigned long *end_pfn)
+{
+	unsigned long start = *start_pfn << PAGE_SHIFT;
+	unsigned long end = *end_pfn << PAGE_SHIFT;
+	int i;
+
+	if (start >= end)
+		goto fail;		/* short-circuit e820 checks */
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		unsigned long last, addr;
+
+		addr = round_up(ei->addr, PAGE_SIZE);
+		last = round_down(ei->addr + ei->size, PAGE_SIZE);
+
+		if (addr >= end)
+			goto fail;	/* We're done, not found */
+		if (last <= start)
+			continue;	/* Not at start yet, move on */
+		if (ei->type != E820_RAM)
+			continue;	/* Not RAM, move on */
+
+		/*
+		 * We've found RAM. If start is in this e820 range, return
+		 * it, otherwise return the start of this e820 range.
+		 */
+
+		if (addr > start)
+			*start_pfn = addr >> PAGE_SHIFT;
+		if (last < end)
+			*end_pfn = last >> PAGE_SHIFT;
+		return;
+	}
+fail:
+	*start_pfn = *end_pfn;
+	return;				/* No luck, return failure */
+}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..6657106 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -930,6 +930,12 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
 #define pfn_to_nid(pfn)		(0)
 #endif
 
+#ifndef ARCH_HAVE_PFN_SKIP_HOLE
+static inline void pfn_skip_hole(unsigned long *start, unsigned long *end)
+{
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e55056f..b75d41f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5747,22 +5747,29 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 	end = pfn + pgdat->node_spanned_pages;
 
 	while (pfn < end) {
-		if (need_resched()) {
-			pgdat_resize_unlock(pgdat, &flags);
-			cond_resched();
-			pgdat_resize_lock(pgdat, &flags);
+		unsigned long contiguous = end;
+
+		/* restrict pfn..contiguous to be a RAM backed range */
+		pfn_skip_hole(&pfn, &contiguous);
+
+		while (pfn < contiguous) {
+			if (need_resched()) {
+				pgdat_resize_unlock(pgdat, &flags);
+				cond_resched();
+				pgdat_resize_lock(pgdat, &flags);
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-			/* abort if the node got resized */
-			if (pfn < pgdat->node_start_pfn ||
-			    end > (pgdat->node_start_pfn +
-				   pgdat->node_spanned_pages))
-				goto abort;
+				/* abort if the node got resized */
+				if (pfn < pgdat->node_start_pfn ||
+				    end > (pgdat->node_start_pfn +
+					   pgdat->node_spanned_pages))
+					goto abort;
 #endif
-		}
+			}
 
-		pfn += pfn_valid(pfn) ?
-			kstaled_scan_page(pfn_to_page(pfn)) : 1;
+			pfn += pfn_valid(pfn) ?
+				kstaled_scan_page(pfn_to_page(pfn)) : 1;
+		}
 	}
 
 abort:
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 6/9] kstaled: rate limit pages scanned per second.
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:49   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Scan some number of pages from each node every second, instead of trying to
scan the entime memory at once and being idle for the rest of the configured
interval.

In addition to spreading the CPU usage over the entire scanning interval,
this also reduces the jitter between two consecutive scans of the same page.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mmzone.h |    3 ++
 mm/memcontrol.c        |   71 ++++++++++++++++++++++++++++++++++-------------
 2 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6657106..272fbed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -631,6 +631,9 @@ typedef struct pglist_data {
 	unsigned long node_present_pages; /* total number of physical pages */
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
+#ifdef CONFIG_KSTALED
+	unsigned long node_idle_scan_pfn;
+#endif
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b75d41f..b468867 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5736,15 +5736,19 @@ static unsigned kstaled_scan_page(struct page *page)
 	return nr_pages;
 }
 
-static void kstaled_scan_node(pg_data_t *pgdat)
+static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
-	unsigned long pfn, end;
+	unsigned long pfn, end, node_end;
 
 	pgdat_resize_lock(pgdat, &flags);
 
 	pfn = pgdat->node_start_pfn;
-	end = pfn + pgdat->node_spanned_pages;
+	node_end = pfn + pgdat->node_spanned_pages;
+	if (!reset && pfn < pgdat->node_idle_scan_pfn)
+		pfn = pgdat->node_idle_scan_pfn;
+	end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, scan_seconds),
+		  node_end);
 
 	while (pfn < end) {
 		unsigned long contiguous = end;
@@ -5761,8 +5765,8 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 #ifdef CONFIG_MEMORY_HOTPLUG
 				/* abort if the node got resized */
 				if (pfn < pgdat->node_start_pfn ||
-				    end > (pgdat->node_start_pfn +
-					   pgdat->node_spanned_pages))
+				    node_end > (pgdat->node_start_pfn +
+						pgdat->node_spanned_pages))
 					goto abort;
 #endif
 			}
@@ -5774,17 +5778,30 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 
 abort:
 	pgdat_resize_unlock(pgdat, &flags);
+
+	pgdat->node_idle_scan_pfn = min(pfn, end);
+	return pfn >= node_end;
 }
 
 static int kstaled(void *dummy)
 {
+	bool reset = true;
+	long deadline = jiffies;
+
 	while (1) {
 		int scan_seconds;
 		int nid;
-		struct mem_cgroup *memcg;
+		long delta;
+		bool scan_done;
+
+		deadline += HZ;
+		scan_seconds = kstaled_scan_seconds;
+		if (scan_seconds <= 0) {
+			wait_event_interruptible(kstaled_wait,
+				(scan_seconds = kstaled_scan_seconds) > 0);
+			deadline = jiffies + HZ;
+		}
 
-		wait_event_interruptible(kstaled_wait,
-				 (scan_seconds = kstaled_scan_seconds) > 0);
 		/*
 		 * We use interruptible wait_event so as not to contribute
 		 * to the machine load average while we're sleeping.
@@ -5794,21 +5811,35 @@ static int kstaled(void *dummy)
 		 */
 		BUG_ON(scan_seconds <= 0);
 
-		for_each_mem_cgroup_all(memcg)
-			memset(&memcg->idle_scan_stats, 0,
-			       sizeof(memcg->idle_scan_stats));
-
+		scan_done = true;
 		for_each_node_state(nid, N_HIGH_MEMORY)
-			kstaled_scan_node(NODE_DATA(nid));
-
-		for_each_mem_cgroup_all(memcg) {
-			write_seqcount_begin(&memcg->idle_page_stats_lock);
-			memcg->idle_page_stats = memcg->idle_scan_stats;
-			memcg->idle_page_scans++;
-			write_seqcount_end(&memcg->idle_page_stats_lock);
+			scan_done &= kstaled_scan_node(NODE_DATA(nid),
+						       scan_seconds, reset);
+
+		if (scan_done) {
+			struct mem_cgroup *memcg;
+
+			for_each_mem_cgroup_all(memcg) {
+				write_seqcount_begin(
+					&memcg->idle_page_stats_lock);
+				memcg->idle_page_stats =
+					memcg->idle_scan_stats;
+				memcg->idle_page_scans++;
+				write_seqcount_end(
+					&memcg->idle_page_stats_lock);
+				memset(&memcg->idle_scan_stats, 0,
+				       sizeof(memcg->idle_scan_stats));
+			}
 		}
 
-		schedule_timeout_interruptible(scan_seconds * HZ);
+		delta = jiffies - deadline;
+		if (delta < 0)
+			schedule_timeout_interruptible(-delta);
+		else if (delta >= HZ)
+			pr_warning("kstaled running %ld.%02d seconds late\n",
+				   delta / HZ, (int)(delta % HZ) * 100 / HZ);
+
+		reset = scan_done;
 	}
 
 	BUG();
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 6/9] kstaled: rate limit pages scanned per second.
@ 2011-09-28  0:49   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Scan some number of pages from each node every second, instead of trying to
scan the entime memory at once and being idle for the rest of the configured
interval.

In addition to spreading the CPU usage over the entire scanning interval,
this also reduces the jitter between two consecutive scans of the same page.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mmzone.h |    3 ++
 mm/memcontrol.c        |   71 ++++++++++++++++++++++++++++++++++-------------
 2 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6657106..272fbed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -631,6 +631,9 @@ typedef struct pglist_data {
 	unsigned long node_present_pages; /* total number of physical pages */
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
+#ifdef CONFIG_KSTALED
+	unsigned long node_idle_scan_pfn;
+#endif
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b75d41f..b468867 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5736,15 +5736,19 @@ static unsigned kstaled_scan_page(struct page *page)
 	return nr_pages;
 }
 
-static void kstaled_scan_node(pg_data_t *pgdat)
+static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
-	unsigned long pfn, end;
+	unsigned long pfn, end, node_end;
 
 	pgdat_resize_lock(pgdat, &flags);
 
 	pfn = pgdat->node_start_pfn;
-	end = pfn + pgdat->node_spanned_pages;
+	node_end = pfn + pgdat->node_spanned_pages;
+	if (!reset && pfn < pgdat->node_idle_scan_pfn)
+		pfn = pgdat->node_idle_scan_pfn;
+	end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, scan_seconds),
+		  node_end);
 
 	while (pfn < end) {
 		unsigned long contiguous = end;
@@ -5761,8 +5765,8 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 #ifdef CONFIG_MEMORY_HOTPLUG
 				/* abort if the node got resized */
 				if (pfn < pgdat->node_start_pfn ||
-				    end > (pgdat->node_start_pfn +
-					   pgdat->node_spanned_pages))
+				    node_end > (pgdat->node_start_pfn +
+						pgdat->node_spanned_pages))
 					goto abort;
 #endif
 			}
@@ -5774,17 +5778,30 @@ static void kstaled_scan_node(pg_data_t *pgdat)
 
 abort:
 	pgdat_resize_unlock(pgdat, &flags);
+
+	pgdat->node_idle_scan_pfn = min(pfn, end);
+	return pfn >= node_end;
 }
 
 static int kstaled(void *dummy)
 {
+	bool reset = true;
+	long deadline = jiffies;
+
 	while (1) {
 		int scan_seconds;
 		int nid;
-		struct mem_cgroup *memcg;
+		long delta;
+		bool scan_done;
+
+		deadline += HZ;
+		scan_seconds = kstaled_scan_seconds;
+		if (scan_seconds <= 0) {
+			wait_event_interruptible(kstaled_wait,
+				(scan_seconds = kstaled_scan_seconds) > 0);
+			deadline = jiffies + HZ;
+		}
 
-		wait_event_interruptible(kstaled_wait,
-				 (scan_seconds = kstaled_scan_seconds) > 0);
 		/*
 		 * We use interruptible wait_event so as not to contribute
 		 * to the machine load average while we're sleeping.
@@ -5794,21 +5811,35 @@ static int kstaled(void *dummy)
 		 */
 		BUG_ON(scan_seconds <= 0);
 
-		for_each_mem_cgroup_all(memcg)
-			memset(&memcg->idle_scan_stats, 0,
-			       sizeof(memcg->idle_scan_stats));
-
+		scan_done = true;
 		for_each_node_state(nid, N_HIGH_MEMORY)
-			kstaled_scan_node(NODE_DATA(nid));
-
-		for_each_mem_cgroup_all(memcg) {
-			write_seqcount_begin(&memcg->idle_page_stats_lock);
-			memcg->idle_page_stats = memcg->idle_scan_stats;
-			memcg->idle_page_scans++;
-			write_seqcount_end(&memcg->idle_page_stats_lock);
+			scan_done &= kstaled_scan_node(NODE_DATA(nid),
+						       scan_seconds, reset);
+
+		if (scan_done) {
+			struct mem_cgroup *memcg;
+
+			for_each_mem_cgroup_all(memcg) {
+				write_seqcount_begin(
+					&memcg->idle_page_stats_lock);
+				memcg->idle_page_stats =
+					memcg->idle_scan_stats;
+				memcg->idle_page_scans++;
+				write_seqcount_end(
+					&memcg->idle_page_stats_lock);
+				memset(&memcg->idle_scan_stats, 0,
+				       sizeof(memcg->idle_scan_stats));
+			}
 		}
 
-		schedule_timeout_interruptible(scan_seconds * HZ);
+		delta = jiffies - deadline;
+		if (delta < 0)
+			schedule_timeout_interruptible(-delta);
+		else if (delta >= HZ)
+			pr_warning("kstaled running %ld.%02d seconds late\n",
+				   delta / HZ, (int)(delta % HZ) * 100 / HZ);
+
+		reset = scan_done;
 	}
 
 	BUG();
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 7/9] kstaled: add histogram sampling functionality
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:49   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
240 scan intervals into /dev/cgroup/*/memory.idle_page_stats


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mmzone.h |    2 +
 mm/memcontrol.c        |  108 ++++++++++++++++++++++++++++++++++++++----------
 mm/memory_hotplug.c    |    6 +++
 3 files changed, 94 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 272fbed..d8eca1b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -633,6 +633,8 @@ typedef struct pglist_data {
 					     range, including holes */
 #ifdef CONFIG_KSTALED
 	unsigned long node_idle_scan_pfn;
+	u8 *node_idle_page_age;           /* number of scan intervals since
+					     each page was referenced */
 #endif
 	int node_id;
 	wait_queue_head_t kswapd_wait;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b468867..cfe812b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -207,6 +207,11 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
 
+#ifdef CONFIG_KSTALED
+static const int kstaled_buckets[] = {1, 2, 5, 15, 30, 60, 120, 240};
+#define NUM_KSTALED_BUCKETS ARRAY_SIZE(kstaled_buckets)
+#endif
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -292,7 +297,8 @@ struct mem_cgroup {
 		unsigned long idle_clean;
 		unsigned long idle_dirty_file;
 		unsigned long idle_dirty_swap;
-	} idle_page_stats, idle_scan_stats;
+	} idle_page_stats[NUM_KSTALED_BUCKETS],
+	  idle_scan_stats[NUM_KSTALED_BUCKETS];
 	unsigned long idle_page_scans;
 #endif
 };
@@ -4686,18 +4692,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 	unsigned int seqcount;
-	struct idle_page_stats stats;
+	struct idle_page_stats stats[NUM_KSTALED_BUCKETS];
 	unsigned long scans;
+	int bucket;
 
 	do {
 		seqcount = read_seqcount_begin(&memcg->idle_page_stats_lock);
-		stats = memcg->idle_page_stats;
+		memcpy(stats, memcg->idle_page_stats, sizeof(stats));
 		scans = memcg->idle_page_scans;
 	} while (read_seqcount_retry(&memcg->idle_page_stats_lock, seqcount));
 
-	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
-	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
-	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
+	for (bucket = 0; bucket < NUM_KSTALED_BUCKETS; bucket++) {
+		char basename[32], name[32];
+		if (!bucket)
+			sprintf(basename, "idle");
+		else
+			sprintf(basename, "idle_%d", kstaled_buckets[bucket]);
+		sprintf(name, "%s_clean", basename);
+		cb->fill(cb, name, stats[bucket].idle_clean * PAGE_SIZE);
+		sprintf(name, "%s_dirty_file", basename);
+		cb->fill(cb, name, stats[bucket].idle_dirty_file * PAGE_SIZE);
+		sprintf(name, "%s_dirty_swap", basename);
+		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
+	}
 	cb->fill(cb, "scans", scans);
 
 	return 0;
@@ -5619,12 +5636,25 @@ __setup("swapaccount=", enable_swap_account);
 static unsigned int kstaled_scan_seconds;
 static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
 
-static unsigned kstaled_scan_page(struct page *page)
+static inline struct idle_page_stats *
+kstaled_idle_stats(struct mem_cgroup *memcg, int age)
+{
+	int bucket = 0;
+
+	while (age >= kstaled_buckets[bucket + 1])
+		if (++bucket == NUM_KSTALED_BUCKETS - 1)
+			break;
+	return memcg->idle_scan_stats + bucket;
+}
+
+static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
 {
 	bool is_locked = false;
 	bool is_file;
 	struct page_referenced_info info;
 	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+	int age;
 	struct idle_page_stats *stats;
 	unsigned nr_pages;
 
@@ -5704,17 +5734,25 @@ static unsigned kstaled_scan_page(struct page *page)
 
 	/* Find out if the page is idle. Also test for pending mlock. */
 	page_referenced_kstaled(page, is_locked, &info);
-	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
+	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED)) {
+		*idle_page_age = 0;
 		goto out;
+	}
 
 	/* Locate kstaled stats for the page's cgroup. */
 	pc = lookup_page_cgroup(page);
 	if (!pc)
 		goto out;
 	lock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
 	if (!PageCgroupUsed(pc))
 		goto unlock_page_cgroup_out;
-	stats = &pc->mem_cgroup->idle_scan_stats;
+
+	/* Page is idle, increment its age and get the right stats bucket */
+	age = *idle_page_age;
+	if (age < 255)
+		*idle_page_age = ++age;
+	stats = kstaled_idle_stats(memcg, age);
 
 	/* Finally increment the correct statistic for this page. */
 	if (!(info.pr_flags & PR_DIRTY) &&
@@ -5740,11 +5778,22 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
 	unsigned long pfn, end, node_end;
+	u8 *idle_page_age;
 
 	pgdat_resize_lock(pgdat, &flags);
 
+	if (!pgdat->node_idle_page_age) {
+		pgdat->node_idle_page_age = vmalloc(pgdat->node_spanned_pages);
+		if (!pgdat->node_idle_page_age) {
+			pgdat_resize_unlock(pgdat, &flags);
+			return false;
+		}
+		memset(pgdat->node_idle_page_age, 0, pgdat->node_spanned_pages);
+	}
+
 	pfn = pgdat->node_start_pfn;
 	node_end = pfn + pgdat->node_spanned_pages;
+	idle_page_age = pgdat->node_idle_page_age - pfn;
 	if (!reset && pfn < pgdat->node_idle_scan_pfn)
 		pfn = pgdat->node_idle_scan_pfn;
 	end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, scan_seconds),
@@ -5766,13 +5815,15 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 				/* abort if the node got resized */
 				if (pfn < pgdat->node_start_pfn ||
 				    node_end > (pgdat->node_start_pfn +
-						pgdat->node_spanned_pages))
+						pgdat->node_spanned_pages) ||
+				    !pgdat->node_idle_page_age)
 					goto abort;
 #endif
 			}
 
 			pfn += pfn_valid(pfn) ?
-				kstaled_scan_page(pfn_to_page(pfn)) : 1;
+				kstaled_scan_page(pfn_to_page(pfn),
+						  idle_page_age + pfn) : 1;
 		}
 	}
 
@@ -5783,6 +5834,28 @@ abort:
 	return pfn >= node_end;
 }
 
+static void kstaled_update_stats(struct mem_cgroup *memcg)
+{
+	struct idle_page_stats tot;
+	int i;
+
+	memset(&tot, 0, sizeof(tot));
+
+	write_seqcount_begin(&memcg->idle_page_stats_lock);
+	for (i = NUM_KSTALED_BUCKETS - 1; i >= 0; i--) {
+		struct idle_page_stats *idle_scan_bucket;
+		idle_scan_bucket = memcg->idle_scan_stats + i;
+		tot.idle_clean      += idle_scan_bucket->idle_clean;
+		tot.idle_dirty_file += idle_scan_bucket->idle_dirty_file;
+		tot.idle_dirty_swap += idle_scan_bucket->idle_dirty_swap;
+		memcg->idle_page_stats[i] = tot;
+	}
+	memcg->idle_page_scans++;
+	write_seqcount_end(&memcg->idle_page_stats_lock);
+
+	memset(&memcg->idle_scan_stats, 0, sizeof(memcg->idle_scan_stats));
+}
+
 static int kstaled(void *dummy)
 {
 	bool reset = true;
@@ -5819,17 +5892,8 @@ static int kstaled(void *dummy)
 		if (scan_done) {
 			struct mem_cgroup *memcg;
 
-			for_each_mem_cgroup_all(memcg) {
-				write_seqcount_begin(
-					&memcg->idle_page_stats_lock);
-				memcg->idle_page_stats =
-					memcg->idle_scan_stats;
-				memcg->idle_page_scans++;
-				write_seqcount_end(
-					&memcg->idle_page_stats_lock);
-				memset(&memcg->idle_scan_stats, 0,
-				       sizeof(memcg->idle_scan_stats));
-			}
+			for_each_mem_cgroup_all(memcg)
+				kstaled_update_stats(memcg);
 		}
 
 		delta = jiffies - deadline;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..0b490ac 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -211,6 +211,12 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 
 	pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
 					pgdat->node_start_pfn;
+#ifdef CONFIG_KSTALED
+	if (pgdat->node_idle_page_age) {
+		vfree(pgdat->node_idle_page_age);
+		pgdat->node_idle_page_age = NULL;
+	}
+#endif
 }
 
 static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 7/9] kstaled: add histogram sampling functionality
@ 2011-09-28  0:49   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
240 scan intervals into /dev/cgroup/*/memory.idle_page_stats


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mmzone.h |    2 +
 mm/memcontrol.c        |  108 ++++++++++++++++++++++++++++++++++++++----------
 mm/memory_hotplug.c    |    6 +++
 3 files changed, 94 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 272fbed..d8eca1b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -633,6 +633,8 @@ typedef struct pglist_data {
 					     range, including holes */
 #ifdef CONFIG_KSTALED
 	unsigned long node_idle_scan_pfn;
+	u8 *node_idle_page_age;           /* number of scan intervals since
+					     each page was referenced */
 #endif
 	int node_id;
 	wait_queue_head_t kswapd_wait;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b468867..cfe812b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -207,6 +207,11 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
 
+#ifdef CONFIG_KSTALED
+static const int kstaled_buckets[] = {1, 2, 5, 15, 30, 60, 120, 240};
+#define NUM_KSTALED_BUCKETS ARRAY_SIZE(kstaled_buckets)
+#endif
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -292,7 +297,8 @@ struct mem_cgroup {
 		unsigned long idle_clean;
 		unsigned long idle_dirty_file;
 		unsigned long idle_dirty_swap;
-	} idle_page_stats, idle_scan_stats;
+	} idle_page_stats[NUM_KSTALED_BUCKETS],
+	  idle_scan_stats[NUM_KSTALED_BUCKETS];
 	unsigned long idle_page_scans;
 #endif
 };
@@ -4686,18 +4692,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 	unsigned int seqcount;
-	struct idle_page_stats stats;
+	struct idle_page_stats stats[NUM_KSTALED_BUCKETS];
 	unsigned long scans;
+	int bucket;
 
 	do {
 		seqcount = read_seqcount_begin(&memcg->idle_page_stats_lock);
-		stats = memcg->idle_page_stats;
+		memcpy(stats, memcg->idle_page_stats, sizeof(stats));
 		scans = memcg->idle_page_scans;
 	} while (read_seqcount_retry(&memcg->idle_page_stats_lock, seqcount));
 
-	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
-	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
-	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
+	for (bucket = 0; bucket < NUM_KSTALED_BUCKETS; bucket++) {
+		char basename[32], name[32];
+		if (!bucket)
+			sprintf(basename, "idle");
+		else
+			sprintf(basename, "idle_%d", kstaled_buckets[bucket]);
+		sprintf(name, "%s_clean", basename);
+		cb->fill(cb, name, stats[bucket].idle_clean * PAGE_SIZE);
+		sprintf(name, "%s_dirty_file", basename);
+		cb->fill(cb, name, stats[bucket].idle_dirty_file * PAGE_SIZE);
+		sprintf(name, "%s_dirty_swap", basename);
+		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
+	}
 	cb->fill(cb, "scans", scans);
 
 	return 0;
@@ -5619,12 +5636,25 @@ __setup("swapaccount=", enable_swap_account);
 static unsigned int kstaled_scan_seconds;
 static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
 
-static unsigned kstaled_scan_page(struct page *page)
+static inline struct idle_page_stats *
+kstaled_idle_stats(struct mem_cgroup *memcg, int age)
+{
+	int bucket = 0;
+
+	while (age >= kstaled_buckets[bucket + 1])
+		if (++bucket == NUM_KSTALED_BUCKETS - 1)
+			break;
+	return memcg->idle_scan_stats + bucket;
+}
+
+static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
 {
 	bool is_locked = false;
 	bool is_file;
 	struct page_referenced_info info;
 	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+	int age;
 	struct idle_page_stats *stats;
 	unsigned nr_pages;
 
@@ -5704,17 +5734,25 @@ static unsigned kstaled_scan_page(struct page *page)
 
 	/* Find out if the page is idle. Also test for pending mlock. */
 	page_referenced_kstaled(page, is_locked, &info);
-	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
+	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED)) {
+		*idle_page_age = 0;
 		goto out;
+	}
 
 	/* Locate kstaled stats for the page's cgroup. */
 	pc = lookup_page_cgroup(page);
 	if (!pc)
 		goto out;
 	lock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
 	if (!PageCgroupUsed(pc))
 		goto unlock_page_cgroup_out;
-	stats = &pc->mem_cgroup->idle_scan_stats;
+
+	/* Page is idle, increment its age and get the right stats bucket */
+	age = *idle_page_age;
+	if (age < 255)
+		*idle_page_age = ++age;
+	stats = kstaled_idle_stats(memcg, age);
 
 	/* Finally increment the correct statistic for this page. */
 	if (!(info.pr_flags & PR_DIRTY) &&
@@ -5740,11 +5778,22 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
 	unsigned long pfn, end, node_end;
+	u8 *idle_page_age;
 
 	pgdat_resize_lock(pgdat, &flags);
 
+	if (!pgdat->node_idle_page_age) {
+		pgdat->node_idle_page_age = vmalloc(pgdat->node_spanned_pages);
+		if (!pgdat->node_idle_page_age) {
+			pgdat_resize_unlock(pgdat, &flags);
+			return false;
+		}
+		memset(pgdat->node_idle_page_age, 0, pgdat->node_spanned_pages);
+	}
+
 	pfn = pgdat->node_start_pfn;
 	node_end = pfn + pgdat->node_spanned_pages;
+	idle_page_age = pgdat->node_idle_page_age - pfn;
 	if (!reset && pfn < pgdat->node_idle_scan_pfn)
 		pfn = pgdat->node_idle_scan_pfn;
 	end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, scan_seconds),
@@ -5766,13 +5815,15 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 				/* abort if the node got resized */
 				if (pfn < pgdat->node_start_pfn ||
 				    node_end > (pgdat->node_start_pfn +
-						pgdat->node_spanned_pages))
+						pgdat->node_spanned_pages) ||
+				    !pgdat->node_idle_page_age)
 					goto abort;
 #endif
 			}
 
 			pfn += pfn_valid(pfn) ?
-				kstaled_scan_page(pfn_to_page(pfn)) : 1;
+				kstaled_scan_page(pfn_to_page(pfn),
+						  idle_page_age + pfn) : 1;
 		}
 	}
 
@@ -5783,6 +5834,28 @@ abort:
 	return pfn >= node_end;
 }
 
+static void kstaled_update_stats(struct mem_cgroup *memcg)
+{
+	struct idle_page_stats tot;
+	int i;
+
+	memset(&tot, 0, sizeof(tot));
+
+	write_seqcount_begin(&memcg->idle_page_stats_lock);
+	for (i = NUM_KSTALED_BUCKETS - 1; i >= 0; i--) {
+		struct idle_page_stats *idle_scan_bucket;
+		idle_scan_bucket = memcg->idle_scan_stats + i;
+		tot.idle_clean      += idle_scan_bucket->idle_clean;
+		tot.idle_dirty_file += idle_scan_bucket->idle_dirty_file;
+		tot.idle_dirty_swap += idle_scan_bucket->idle_dirty_swap;
+		memcg->idle_page_stats[i] = tot;
+	}
+	memcg->idle_page_scans++;
+	write_seqcount_end(&memcg->idle_page_stats_lock);
+
+	memset(&memcg->idle_scan_stats, 0, sizeof(memcg->idle_scan_stats));
+}
+
 static int kstaled(void *dummy)
 {
 	bool reset = true;
@@ -5819,17 +5892,8 @@ static int kstaled(void *dummy)
 		if (scan_done) {
 			struct mem_cgroup *memcg;
 
-			for_each_mem_cgroup_all(memcg) {
-				write_seqcount_begin(
-					&memcg->idle_page_stats_lock);
-				memcg->idle_page_stats =
-					memcg->idle_scan_stats;
-				memcg->idle_page_scans++;
-				write_seqcount_end(
-					&memcg->idle_page_stats_lock);
-				memset(&memcg->idle_scan_stats, 0,
-				       sizeof(memcg->idle_scan_stats));
-			}
+			for_each_mem_cgroup_all(memcg)
+				kstaled_update_stats(memcg);
 		}
 
 		delta = jiffies - deadline;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..0b490ac 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -211,6 +211,12 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 
 	pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
 					pgdat->node_start_pfn;
+#ifdef CONFIG_KSTALED
+	if (pgdat->node_idle_page_age) {
+		vfree(pgdat->node_idle_page_age);
+		pgdat->node_idle_page_age = NULL;
+	}
+#endif
 }
 
 static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 8/9] kstaled: add incrementally updating stale page count
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:49   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Add an incrementally updating stale page count. A new per-cgroup
memory.stale_page_age file is introduced. After a non-zero number of scan
cycles is written there, pages that have been idle for at least that number
of cycles and are currently clean are reported in memory.idle_page_stats
as being stale. Contrary to the idle_*_clean statistic, this stale page
count is continually updated - hooks have been added to notice pages being
accessed or rendered unevictable, at which point the stale page count for
that cgroup is instantly decremented. The point is to allow userspace to
quickly respond to increased memory pressure.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/page-flags.h |   15 ++++++++
 include/linux/pagemap.h    |   11 ++++--
 mm/internal.h              |    1 +
 mm/memcontrol.c            |   86 ++++++++++++++++++++++++++++++++++++++++++--
 mm/mlock.c                 |    1 +
 mm/vmscan.c                |    2 +-
 6 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e964d98..22dbe90 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -58,6 +58,8 @@
  *
  * PG_idle indicates that the page has not been referenced since the last time
  * kstaled scanned it.
+ *
+ * PG_stale indicates that the page is currently counted as stale.
  */
 
 /*
@@ -117,6 +119,7 @@ enum pageflags {
 #ifdef CONFIG_KSTALED
 	PG_young,		/* kstaled cleared pte_young */
 	PG_idle,		/* idle since start of kstaled interval */
+	PG_stale,		/* page is counted as stale */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -293,21 +296,33 @@ PAGEFLAG_FALSE(HWPoison)
 
 PAGEFLAG(Young, young)
 PAGEFLAG(Idle, idle)
+PAGEFLAG(Stale, stale) TESTSCFLAG(Stale, stale)
+
+void __set_page_nonstale(struct page *page);
+
+static inline void set_page_nonstale(struct page *page)
+{
+	if (PageStale(page))
+		__set_page_nonstale(page);
+}
 
 static inline void set_page_young(struct page *page)
 {
+	set_page_nonstale(page);
 	if (!PageYoung(page))
 		SetPageYoung(page);
 }
 
 static inline void clear_page_idle(struct page *page)
 {
+	set_page_nonstale(page);
 	if (PageIdle(page))
 		ClearPageIdle(page);
 }
 
 #else /* !CONFIG_KSTALED */
 
+static inline void set_page_nonstale(struct page *page) {}
 static inline void set_page_young(struct page *page) {}
 static inline void clear_page_idle(struct page *page) {}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..693dd20 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -46,11 +46,14 @@ static inline void mapping_clear_unevictable(struct address_space *mapping)
 	clear_bit(AS_UNEVICTABLE, &mapping->flags);
 }
 
-static inline int mapping_unevictable(struct address_space *mapping)
+static inline int mapping_unevictable(struct address_space *mapping,
+				      struct page *page)
 {
-	if (mapping)
-		return test_bit(AS_UNEVICTABLE, &mapping->flags);
-	return !!mapping;
+	if (mapping && test_bit(AS_UNEVICTABLE, &mapping->flags)) {
+		set_page_nonstale(page);
+		return 1;
+	}
+	return 0;
 }
 
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
diff --git a/mm/internal.h b/mm/internal.h
index d071d38..d1cb0d6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -93,6 +93,7 @@ static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
 		return 0;
 
 	if (!TestSetPageMlocked(page)) {
+		set_page_nonstale(page);
 		inc_zone_page_state(page, NR_MLOCK);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cfe812b..5140add 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,6 +292,8 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 #ifdef CONFIG_KSTALED
+	int stale_page_age;
+
 	seqcount_t idle_page_stats_lock;
 	struct idle_page_stats {
 		unsigned long idle_clean;
@@ -299,6 +301,7 @@ struct mem_cgroup {
 		unsigned long idle_dirty_swap;
 	} idle_page_stats[NUM_KSTALED_BUCKETS],
 	  idle_scan_stats[NUM_KSTALED_BUCKETS];
+	atomic_long_t stale_pages;
 	unsigned long idle_page_scans;
 #endif
 };
@@ -2639,6 +2642,13 @@ static int mem_cgroup_move_account(struct page *page,
 		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
+
+#ifdef CONFIG_KSTALED
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&from->stale_pages);
+#endif
+
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
 		__mem_cgroup_cancel_charge(from, nr_pages);
@@ -3067,6 +3077,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 
 	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -nr_pages);
 
+#ifdef CONFIG_KSTALED
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&mem->stale_pages);
+#endif
+
 	ClearPageCgroupUsed(pc);
 	/*
 	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
@@ -4716,6 +4732,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
 		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
 	}
 	cb->fill(cb, "scans", scans);
+	cb->fill(cb, "stale",
+		 max(atomic_long_read(&memcg->stale_pages), 0L) * PAGE_SIZE);
+
+	return 0;
+}
+
+static u64 mem_cgroup_stale_page_age_read(struct cgroup *cgrp,
+					  struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->stale_page_age;
+}
+
+static int mem_cgroup_stale_page_age_write(struct cgroup *cgrp,
+					   struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	if (val > 255)
+		return -EINVAL;
+
+	memcg->stale_page_age = val;
 
 	return 0;
 }
@@ -4796,6 +4835,11 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "idle_page_stats",
 		.read_map = mem_cgroup_idle_page_stats_read,
 	},
+	{
+		.name = "stale_page_age",
+		.read_u64 = mem_cgroup_stale_page_age_read,
+		.write_u64 = mem_cgroup_stale_page_age_write,
+	},
 #endif
 };
 
@@ -5721,7 +5765,7 @@ static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
 		 */
 		if (!mapping && !PageSwapCache(page))
 			goto out;
-		else if (mapping_unevictable(mapping))
+		else if (mapping_unevictable(mapping, page))
 			goto out;
 		else if (PageSwapCache(page) ||
 			 mapping_cap_swap_backed(mapping))
@@ -5756,13 +5800,27 @@ static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
 
 	/* Finally increment the correct statistic for this page. */
 	if (!(info.pr_flags & PR_DIRTY) &&
-	    !PageDirty(page) && !PageWriteback(page))
+	    !PageDirty(page) && !PageWriteback(page)) {
 		stats->idle_clean += nr_pages;
-	else if (is_file)
+
+		/* THP pages are currently always accounted for as dirty... */
+		VM_BUG_ON(nr_pages != 1);
+		if (memcg->stale_page_age && age >= memcg->stale_page_age) {
+			if (!PageStale(page) && !TestSetPageStale(page))
+				atomic_long_inc(&memcg->stale_pages);
+			goto unlock_page_cgroup_out;
+		}
+	} else if (is_file)
 		stats->idle_dirty_file += nr_pages;
 	else
 		stats->idle_dirty_swap += nr_pages;
 
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page)) {
+		VM_BUG_ON(nr_pages != 1);
+		atomic_long_dec(&memcg->stale_pages);
+	}
+
  unlock_page_cgroup_out:
 	unlock_page_cgroup(pc);
 
@@ -5774,6 +5832,28 @@ static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
 	return nr_pages;
 }
 
+void __set_page_nonstale(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+
+	/* Locate kstaled stats for the page's cgroup. */
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		return;
+	lock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
+	if (!PageCgroupUsed(pc))
+		goto out;
+
+	/* Count page as non-stale */
+	if (TestClearPageStale(page))
+		atomic_long_dec(&memcg->stale_pages);
+
+out:
+	unlock_page_cgroup(pc);
+}
+
 static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
diff --git a/mm/mlock.c b/mm/mlock.c
index 048260c..eac4c32 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -81,6 +81,7 @@ void mlock_vma_page(struct page *page)
 	BUG_ON(!PageLocked(page));
 
 	if (!TestSetPageMlocked(page)) {
+		set_page_nonstale(page);
 		inc_zone_page_state(page, NR_MLOCK);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 		if (!isolate_lru_page(page))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f0a8a1d..1fefc73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3203,7 +3203,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 int page_evictable(struct page *page, struct vm_area_struct *vma)
 {
 
-	if (mapping_unevictable(page_mapping(page)))
+	if (mapping_unevictable(page_mapping(page), page))
 		return 0;
 
 	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 8/9] kstaled: add incrementally updating stale page count
@ 2011-09-28  0:49   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

Add an incrementally updating stale page count. A new per-cgroup
memory.stale_page_age file is introduced. After a non-zero number of scan
cycles is written there, pages that have been idle for at least that number
of cycles and are currently clean are reported in memory.idle_page_stats
as being stale. Contrary to the idle_*_clean statistic, this stale page
count is continually updated - hooks have been added to notice pages being
accessed or rendered unevictable, at which point the stale page count for
that cgroup is instantly decremented. The point is to allow userspace to
quickly respond to increased memory pressure.


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/page-flags.h |   15 ++++++++
 include/linux/pagemap.h    |   11 ++++--
 mm/internal.h              |    1 +
 mm/memcontrol.c            |   86 ++++++++++++++++++++++++++++++++++++++++++--
 mm/mlock.c                 |    1 +
 mm/vmscan.c                |    2 +-
 6 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e964d98..22dbe90 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -58,6 +58,8 @@
  *
  * PG_idle indicates that the page has not been referenced since the last time
  * kstaled scanned it.
+ *
+ * PG_stale indicates that the page is currently counted as stale.
  */
 
 /*
@@ -117,6 +119,7 @@ enum pageflags {
 #ifdef CONFIG_KSTALED
 	PG_young,		/* kstaled cleared pte_young */
 	PG_idle,		/* idle since start of kstaled interval */
+	PG_stale,		/* page is counted as stale */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -293,21 +296,33 @@ PAGEFLAG_FALSE(HWPoison)
 
 PAGEFLAG(Young, young)
 PAGEFLAG(Idle, idle)
+PAGEFLAG(Stale, stale) TESTSCFLAG(Stale, stale)
+
+void __set_page_nonstale(struct page *page);
+
+static inline void set_page_nonstale(struct page *page)
+{
+	if (PageStale(page))
+		__set_page_nonstale(page);
+}
 
 static inline void set_page_young(struct page *page)
 {
+	set_page_nonstale(page);
 	if (!PageYoung(page))
 		SetPageYoung(page);
 }
 
 static inline void clear_page_idle(struct page *page)
 {
+	set_page_nonstale(page);
 	if (PageIdle(page))
 		ClearPageIdle(page);
 }
 
 #else /* !CONFIG_KSTALED */
 
+static inline void set_page_nonstale(struct page *page) {}
 static inline void set_page_young(struct page *page) {}
 static inline void clear_page_idle(struct page *page) {}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..693dd20 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -46,11 +46,14 @@ static inline void mapping_clear_unevictable(struct address_space *mapping)
 	clear_bit(AS_UNEVICTABLE, &mapping->flags);
 }
 
-static inline int mapping_unevictable(struct address_space *mapping)
+static inline int mapping_unevictable(struct address_space *mapping,
+				      struct page *page)
 {
-	if (mapping)
-		return test_bit(AS_UNEVICTABLE, &mapping->flags);
-	return !!mapping;
+	if (mapping && test_bit(AS_UNEVICTABLE, &mapping->flags)) {
+		set_page_nonstale(page);
+		return 1;
+	}
+	return 0;
 }
 
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
diff --git a/mm/internal.h b/mm/internal.h
index d071d38..d1cb0d6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -93,6 +93,7 @@ static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
 		return 0;
 
 	if (!TestSetPageMlocked(page)) {
+		set_page_nonstale(page);
 		inc_zone_page_state(page, NR_MLOCK);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cfe812b..5140add 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,6 +292,8 @@ struct mem_cgroup {
 	spinlock_t pcp_counter_lock;
 
 #ifdef CONFIG_KSTALED
+	int stale_page_age;
+
 	seqcount_t idle_page_stats_lock;
 	struct idle_page_stats {
 		unsigned long idle_clean;
@@ -299,6 +301,7 @@ struct mem_cgroup {
 		unsigned long idle_dirty_swap;
 	} idle_page_stats[NUM_KSTALED_BUCKETS],
 	  idle_scan_stats[NUM_KSTALED_BUCKETS];
+	atomic_long_t stale_pages;
 	unsigned long idle_page_scans;
 #endif
 };
@@ -2639,6 +2642,13 @@ static int mem_cgroup_move_account(struct page *page,
 		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
+
+#ifdef CONFIG_KSTALED
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&from->stale_pages);
+#endif
+
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
 		__mem_cgroup_cancel_charge(from, nr_pages);
@@ -3067,6 +3077,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 
 	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -nr_pages);
 
+#ifdef CONFIG_KSTALED
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page))
+		atomic_long_dec(&mem->stale_pages);
+#endif
+
 	ClearPageCgroupUsed(pc);
 	/*
 	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
@@ -4716,6 +4732,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
 		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
 	}
 	cb->fill(cb, "scans", scans);
+	cb->fill(cb, "stale",
+		 max(atomic_long_read(&memcg->stale_pages), 0L) * PAGE_SIZE);
+
+	return 0;
+}
+
+static u64 mem_cgroup_stale_page_age_read(struct cgroup *cgrp,
+					  struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return memcg->stale_page_age;
+}
+
+static int mem_cgroup_stale_page_age_write(struct cgroup *cgrp,
+					   struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	if (val > 255)
+		return -EINVAL;
+
+	memcg->stale_page_age = val;
 
 	return 0;
 }
@@ -4796,6 +4835,11 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "idle_page_stats",
 		.read_map = mem_cgroup_idle_page_stats_read,
 	},
+	{
+		.name = "stale_page_age",
+		.read_u64 = mem_cgroup_stale_page_age_read,
+		.write_u64 = mem_cgroup_stale_page_age_write,
+	},
 #endif
 };
 
@@ -5721,7 +5765,7 @@ static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
 		 */
 		if (!mapping && !PageSwapCache(page))
 			goto out;
-		else if (mapping_unevictable(mapping))
+		else if (mapping_unevictable(mapping, page))
 			goto out;
 		else if (PageSwapCache(page) ||
 			 mapping_cap_swap_backed(mapping))
@@ -5756,13 +5800,27 @@ static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
 
 	/* Finally increment the correct statistic for this page. */
 	if (!(info.pr_flags & PR_DIRTY) &&
-	    !PageDirty(page) && !PageWriteback(page))
+	    !PageDirty(page) && !PageWriteback(page)) {
 		stats->idle_clean += nr_pages;
-	else if (is_file)
+
+		/* THP pages are currently always accounted for as dirty... */
+		VM_BUG_ON(nr_pages != 1);
+		if (memcg->stale_page_age && age >= memcg->stale_page_age) {
+			if (!PageStale(page) && !TestSetPageStale(page))
+				atomic_long_inc(&memcg->stale_pages);
+			goto unlock_page_cgroup_out;
+		}
+	} else if (is_file)
 		stats->idle_dirty_file += nr_pages;
 	else
 		stats->idle_dirty_swap += nr_pages;
 
+	/* Count page as non-stale */
+	if (PageStale(page) && TestClearPageStale(page)) {
+		VM_BUG_ON(nr_pages != 1);
+		atomic_long_dec(&memcg->stale_pages);
+	}
+
  unlock_page_cgroup_out:
 	unlock_page_cgroup(pc);
 
@@ -5774,6 +5832,28 @@ static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
 	return nr_pages;
 }
 
+void __set_page_nonstale(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+
+	/* Locate kstaled stats for the page's cgroup. */
+	pc = lookup_page_cgroup(page);
+	if (!pc)
+		return;
+	lock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
+	if (!PageCgroupUsed(pc))
+		goto out;
+
+	/* Count page as non-stale */
+	if (TestClearPageStale(page))
+		atomic_long_dec(&memcg->stale_pages);
+
+out:
+	unlock_page_cgroup(pc);
+}
+
 static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
 {
 	unsigned long flags;
diff --git a/mm/mlock.c b/mm/mlock.c
index 048260c..eac4c32 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -81,6 +81,7 @@ void mlock_vma_page(struct page *page)
 	BUG_ON(!PageLocked(page));
 
 	if (!TestSetPageMlocked(page)) {
+		set_page_nonstale(page);
 		inc_zone_page_state(page, NR_MLOCK);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 		if (!isolate_lru_page(page))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f0a8a1d..1fefc73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3203,7 +3203,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 int page_evictable(struct page *page, struct vm_area_struct *vma)
 {
 
-	if (mapping_unevictable(page_mapping(page)))
+	if (mapping_unevictable(page_mapping(page), page))
 		return 0;
 
 	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 9/9] kstaled: export PG_stale in /proc/kpageflags
  2011-09-28  0:48 ` Michel Lespinasse
@ 2011-09-28  0:49   ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 fs/proc/page.c                    |    4 ++++
 include/linux/kernel-page-flags.h |    2 ++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6d8e6a9..8c3f105 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -159,6 +159,10 @@ u64 stable_page_flags(struct page *page)
 	u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE,	PG_owner_priv_1);
 	u |= kpf_copy_bit(k, KPF_ARCH,		PG_arch_1);
 
+#ifdef CONFIG_KSTALED
+	u |= kpf_copy_bit(k, KPF_STALE,         PG_stale);
+#endif
+
 	return u;
 };
 
diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
index bd92a89..f64acb3 100644
--- a/include/linux/kernel-page-flags.h
+++ b/include/linux/kernel-page-flags.h
@@ -31,6 +31,8 @@
 
 #define KPF_KSM			21
 
+#define KPF_STALE		22
+
 /* kernel hacking assistances
  * WARNING: subject to change, never rely on them!
  */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 9/9] kstaled: export PG_stale in /proc/kpageflags
@ 2011-09-28  0:49   ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  0:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra
  Cc: Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf


Signed-off-by: Michel Lespinasse <walken@google.com>
---
 fs/proc/page.c                    |    4 ++++
 include/linux/kernel-page-flags.h |    2 ++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6d8e6a9..8c3f105 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -159,6 +159,10 @@ u64 stable_page_flags(struct page *page)
 	u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE,	PG_owner_priv_1);
 	u |= kpf_copy_bit(k, KPF_ARCH,		PG_arch_1);
 
+#ifdef CONFIG_KSTALED
+	u |= kpf_copy_bit(k, KPF_STALE,         PG_stale);
+#endif
+
 	return u;
 };
 
diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
index bd92a89..f64acb3 100644
--- a/include/linux/kernel-page-flags.h
+++ b/include/linux/kernel-page-flags.h
@@ -31,6 +31,8 @@
 
 #define KPF_KSM			21
 
+#define KPF_STALE		22
+
 /* kernel hacking assistances
  * WARNING: subject to change, never rely on them!
  */
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/9] page_referenced: replace vm_flags parameter with struct page_referenced_info
  2011-09-28  0:48   ` Michel Lespinasse
@ 2011-09-28  6:28     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  6:28 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:48:59 -0700
Michel Lespinasse <walken@google.com> wrote:

> Introduce struct page_referenced_info, passed into page_referenced() family
> of functions, to represent information about the pte references that have
> been found for that page. Currently contains the vm_flags information as
> well as a PR_REFERENCED flag. The idea is to make it easy to extend the API
> with new flags.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/9] page_referenced: replace vm_flags parameter with struct page_referenced_info
@ 2011-09-28  6:28     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  6:28 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:48:59 -0700
Michel Lespinasse <walken@google.com> wrote:

> Introduce struct page_referenced_info, passed into page_referenced() family
> of functions, to represent information about the pte references that have
> been found for that page. Currently contains the vm_flags information as
> well as a PR_REFERENCED flag. The idea is to make it easy to extend the API
> with new flags.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/9] kstaled: documentation and config option.
  2011-09-28  0:49   ` Michel Lespinasse
@ 2011-09-28  6:53     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  6:53 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:00 -0700
Michel Lespinasse <walken@google.com> wrote:

> Extend memory cgroup documentation do describe the optional idle page
> tracking features, and add the corresponding configuration option.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

> +* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
> +  above, but for pages that have been untouched for at least two scan cycles.
> +* these fields repeat up to idle_240_clean, idle_240_dirty_file and
> +  idle_240_dirty_swap, allowing one to observe idle pages over a variety
> +  of idle interval lengths. Note that the accounting is cumulative:
> +  pages counted as idle for a given interval length are also counted
> +  as idle for smaller interval lengths.

I'm sorry if you've answered already.

Why 240 ? and above means we have idle_xxx_clean/dirty/ xxx is 'seq 2 240' ?
Isn't it messy ? Anyway, idle_1_clean etc should be provided.

Hmm, I don't like the idea very much...

IIUC, there is no kernel interface which shows histgram rather than load_avg[].
Is there any other interface and what histgram is provided ?
And why histgram by kernel is required ? 

BTW, can't this information be exported by /proc/<pid>/smaps or somewhere ?
I guess per-proc will be wanted finally. 


Hm, do you use params other than idle_clean for your scheduling ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/9] kstaled: documentation and config option.
@ 2011-09-28  6:53     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  6:53 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:00 -0700
Michel Lespinasse <walken@google.com> wrote:

> Extend memory cgroup documentation do describe the optional idle page
> tracking features, and add the corresponding configuration option.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

> +* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
> +  above, but for pages that have been untouched for at least two scan cycles.
> +* these fields repeat up to idle_240_clean, idle_240_dirty_file and
> +  idle_240_dirty_swap, allowing one to observe idle pages over a variety
> +  of idle interval lengths. Note that the accounting is cumulative:
> +  pages counted as idle for a given interval length are also counted
> +  as idle for smaller interval lengths.

I'm sorry if you've answered already.

Why 240 ? and above means we have idle_xxx_clean/dirty/ xxx is 'seq 2 240' ?
Isn't it messy ? Anyway, idle_1_clean etc should be provided.

Hmm, I don't like the idea very much...

IIUC, there is no kernel interface which shows histgram rather than load_avg[].
Is there any other interface and what histgram is provided ?
And why histgram by kernel is required ? 

BTW, can't this information be exported by /proc/<pid>/smaps or somewhere ?
I guess per-proc will be wanted finally. 


Hm, do you use params other than idle_clean for your scheduling ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/9] kstaled: page_referenced_kstaled() and supporting infrastructure.
  2011-09-28  0:49   ` Michel Lespinasse
@ 2011-09-28  7:18     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  7:18 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:01 -0700
Michel Lespinasse <walken@google.com> wrote:

> Add a new page_referenced_kstaled() interface. The desired behavior
> is that page_referenced() returns page references since the last
> page_referenced() call, and page_referenced_kstaled() returns page
> references since the last page_referenced_kstaled() call, but they
> are both independent of each other and do not influence each other.
> 
> The following events are counted as kstaled page references:
> - CPU data access to the page (as noticed through pte_young());
> - mark_page_accessed() calls;
> - page being freed / reallocated.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

2 questions.

What happens at Transparent HugeTLB pages are splitted/collapsed ?
Does this feature can ignore page migration i.e. flags should not be copied ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/9] kstaled: page_referenced_kstaled() and supporting infrastructure.
@ 2011-09-28  7:18     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  7:18 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:01 -0700
Michel Lespinasse <walken@google.com> wrote:

> Add a new page_referenced_kstaled() interface. The desired behavior
> is that page_referenced() returns page references since the last
> page_referenced() call, and page_referenced_kstaled() returns page
> references since the last page_referenced_kstaled() call, but they
> are both independent of each other and do not influence each other.
> 
> The following events are counted as kstaled page references:
> - CPU data access to the page (as noticed through pte_young());
> - mark_page_accessed() calls;
> - page being freed / reallocated.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

2 questions.

What happens at Transparent HugeTLB pages are splitted/collapsed ?
Does this feature can ignore page migration i.e. flags should not be copied ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
  2011-09-28  0:49   ` Michel Lespinasse
@ 2011-09-28  7:41     ` Peter Zijlstra
  -1 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2011-09-28  7:41 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Andrea Arcangeli,
	Johannes Weiner, KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 2011-09-27 at 17:49 -0700, Michel Lespinasse wrote:
> +static int kstaled(void *dummy)
> +{
> +       while (1) {

> +       }
> +
> +       BUG();
> +       return 0;       /* NOT REACHED */
> +} 

So if you build with this junk (as I presume distro's will), there is no
way to disable it?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
@ 2011-09-28  7:41     ` Peter Zijlstra
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2011-09-28  7:41 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Andrea Arcangeli,
	Johannes Weiner, KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 2011-09-27 at 17:49 -0700, Michel Lespinasse wrote:
> +static int kstaled(void *dummy)
> +{
> +       while (1) {

> +       }
> +
> +       BUG();
> +       return 0;       /* NOT REACHED */
> +} 

So if you build with this junk (as I presume distro's will), there is no
way to disable it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
  2011-09-28  0:49   ` Michel Lespinasse
@ 2011-09-28  8:00     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:00 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:02 -0700
Michel Lespinasse <walken@google.com> wrote:


> +static unsigned int kstaled_scan_seconds;
> +static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
> +
> +static unsigned kstaled_scan_page(struct page *page)
> +{
> +	bool is_locked = false;
> +	bool is_file;
> +	struct page_referenced_info info;
> +	struct page_cgroup *pc;
> +	struct idle_page_stats *stats;
> +	unsigned nr_pages;
> +
> +	/*
> +	 * Before taking the page reference, check if the page is
> +	 * a user page which is not obviously unreclaimable
> +	 * (we will do more complete checks later).
> +	 */
> +	if (!PageLRU(page) ||
> +	    (!PageCompound(page) &&
> +	     (PageMlocked(page) ||
> +	      (page->mapping == NULL && !PageSwapCache(page)))))
> +		return 1;

Hmm... if you find a page PageCompound(page) && !PageLRU(page),
this returns "1". Is it ok and you'll have no race with khugepaged ?

> +
> +	if (!get_page_unless_zero(page))
> +		return 1;
> +
> +	/* Recheck now that we have the page reference. */
> +	if (unlikely(!PageLRU(page)))
> +		goto out;
> +	nr_pages = 1 << compound_trans_order(page);
> +	if (PageMlocked(page))
> +		goto out;
> +
> +	/*
> +	 * Anon and SwapCache pages can be identified without locking.
> +	 * For all other cases, we need the page locked in order to
> +	 * dereference page->mapping.
> +	 */
> +	if (PageAnon(page) || PageSwapCache(page))
> +		is_file = false;
> +	else if (!trylock_page(page)) {
> +		/*
> +		 * We need to lock the page to dereference the mapping.
> +		 * But don't risk sleeping by calling lock_page().
> +		 * We don't want to stall kstaled, so we conservatively
> +		 * count locked pages as unreclaimable.
> +		 */
> +		goto out;
> +	} else {
> +		struct address_space *mapping = page->mapping;
> +
> +		is_locked = true;
> +
> +		/*
> +		 * The page is still anon - it has been continuously referenced
> +		 * since the prior check.
> +		 */
> +		VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
> +
> +		/*
> +		 * Check the mapping under protection of the page lock.
> +		 * 1. If the page is not swap cache and has no mapping,
> +		 *    shrink_page_list can't do anything with it.
> +		 * 2. If the mapping is unevictable (as in SHM_LOCK segments),
> +		 *    shrink_page_list can't do anything with it.
> +		 * 3. If the page is swap cache or the mapping is swap backed
> +		 *    (as in shmem), consider it a swappable page.
> +		 * 4. If the backing dev has indicated that it does not want
> +		 *    its pages sync'd to disk (as in ramfs), take this as
> +		 *    a hint that its pages are not reclaimable.
> +		 * 5. Otherwise, consider this as a file page reclaimable
> +		 *    through standard pageout.
> +		 */
> +		if (!mapping && !PageSwapCache(page))
> +			goto out;
> +		else if (mapping_unevictable(mapping))
> +			goto out;
> +		else if (PageSwapCache(page) ||
> +			 mapping_cap_swap_backed(mapping))
> +			is_file = false;
> +		else if (!mapping_cap_writeback_dirty(mapping))
> +			goto out;
> +		else
> +			is_file = true;
> +	}
> +
> +	/* Find out if the page is idle. Also test for pending mlock. */
> +	page_referenced_kstaled(page, is_locked, &info);
> +	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
> +		goto out;
> +
> +	/* Locate kstaled stats for the page's cgroup. */
> +	pc = lookup_page_cgroup(page);
> +	if (!pc)
> +		goto out;

This !pc check is not required.

> +	lock_page_cgroup(pc);
> +	if (!PageCgroupUsed(pc))
> +		goto unlock_page_cgroup_out;
> +	stats = &pc->mem_cgroup->idle_scan_stats;
> +
> +	/* Finally increment the correct statistic for this page. */
> +	if (!(info.pr_flags & PR_DIRTY) &&
> +	    !PageDirty(page) && !PageWriteback(page))
> +		stats->idle_clean += nr_pages;
> +	else if (is_file)
> +		stats->idle_dirty_file += nr_pages;
> +	else
> +		stats->idle_dirty_swap += nr_pages;
> +
> + unlock_page_cgroup_out:
> +	unlock_page_cgroup(pc);
> +

unlock_page_out:
	unlock_page(page);
out:
	put_page(page);

?

Hm, btw, if you put 'stats' into per-zone struct of memcg,
you'll have a chance to get per-node/zone idle stats.
you don't want it ?




> + out:
> +	if (is_locked)
> +		unlock_page(page);
> +	put_page(page);
> +
> +	return nr_pages;
> +}
> +
> +static void kstaled_scan_node(pg_data_t *pgdat)
> +{
> +	unsigned long flags;
> +	unsigned long pfn, end;
> +
> +	pgdat_resize_lock(pgdat, &flags);
> +

pgdat_resize_lock() is a spin lock irq disabling..so
IRQ will be blocked while you do scanning.

I think lock_memory_hotplug() will be better and I think it's enough.


> +	pfn = pgdat->node_start_pfn;
> +	end = pfn + pgdat->node_spanned_pages;
> +
> +	while (pfn < end) {
> +		if (need_resched()) {
> +			pgdat_resize_unlock(pgdat, &flags);
> +			cond_resched();
> +			pgdat_resize_lock(pgdat, &flags);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +			/* abort if the node got resized */
> +			if (pfn < pgdat->node_start_pfn ||
> +			    end > (pgdat->node_start_pfn +
> +				   pgdat->node_spanned_pages))
> +				goto abort;
> +#endif
> +		}
> +
> +		pfn += pfn_valid(pfn) ?
> +			kstaled_scan_page(pfn_to_page(pfn)) : 1;

There is a server which has following node layout as:

pfn 0 <---------------------------------------------> max_pfn
      node0 node1 node2 node3 node0 node1 node2 node3

Then, you may scan pages multiple times. please check node id.



> +	}
> +
> +abort:
> +	pgdat_resize_unlock(pgdat, &flags);
> +}
> +
> +static int kstaled(void *dummy)
> +{
> +	while (1) {
> +		int scan_seconds;
> +		int nid;
> +		struct mem_cgroup *memcg;
> +
> +		wait_event_interruptible(kstaled_wait,
> +				 (scan_seconds = kstaled_scan_seconds) > 0);
> +		/*
> +		 * We use interruptible wait_event so as not to contribute
> +		 * to the machine load average while we're sleeping.
> +		 * However, we don't actually expect to receive a signal
> +		 * since we run as a kernel thread, so the condition we were
> +		 * waiting for should be true once we get here.
> +		 */
> +		BUG_ON(scan_seconds <= 0);
> +
> +		for_each_mem_cgroup_all(memcg)
> +			memset(&memcg->idle_scan_stats, 0,
> +			       sizeof(memcg->idle_scan_stats));
> +
> +		for_each_node_state(nid, N_HIGH_MEMORY)
> +			kstaled_scan_node(NODE_DATA(nid));
> +
> +		for_each_mem_cgroup_all(memcg) {
> +			write_seqcount_begin(&memcg->idle_page_stats_lock);
> +			memcg->idle_page_stats = memcg->idle_scan_stats;
> +			memcg->idle_page_scans++;
> +			write_seqcount_end(&memcg->idle_page_stats_lock);
> +		}
> +
> +		schedule_timeout_interruptible(scan_seconds * HZ);

Hm, timeout is the best trigger ?




> +	}
> +
> +	BUG();
> +	return 0;	/* NOT REACHED */
> +}
> +
> +static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
> +					 struct kobj_attribute *attr,
> +					 char *buf)
> +{
> +	return sprintf(buf, "%u\n", kstaled_scan_seconds);
> +}
> +
> +static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
> +					  struct kobj_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long input;
> +
> +	err = kstrtoul(buf, 10, &input);
> +	if (err)
> +		return -EINVAL;
> +	kstaled_scan_seconds = input;
> +	wake_up_interruptible(&kstaled_wait);
> +	return count;
> +}
> +

How the user should calculated the scan interval ?
Can't this be selected in (semi-)automatic way ?

Thanks
-Kame


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
@ 2011-09-28  8:00     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:00 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:02 -0700
Michel Lespinasse <walken@google.com> wrote:


> +static unsigned int kstaled_scan_seconds;
> +static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
> +
> +static unsigned kstaled_scan_page(struct page *page)
> +{
> +	bool is_locked = false;
> +	bool is_file;
> +	struct page_referenced_info info;
> +	struct page_cgroup *pc;
> +	struct idle_page_stats *stats;
> +	unsigned nr_pages;
> +
> +	/*
> +	 * Before taking the page reference, check if the page is
> +	 * a user page which is not obviously unreclaimable
> +	 * (we will do more complete checks later).
> +	 */
> +	if (!PageLRU(page) ||
> +	    (!PageCompound(page) &&
> +	     (PageMlocked(page) ||
> +	      (page->mapping == NULL && !PageSwapCache(page)))))
> +		return 1;

Hmm... if you find a page PageCompound(page) && !PageLRU(page),
this returns "1". Is it ok and you'll have no race with khugepaged ?

> +
> +	if (!get_page_unless_zero(page))
> +		return 1;
> +
> +	/* Recheck now that we have the page reference. */
> +	if (unlikely(!PageLRU(page)))
> +		goto out;
> +	nr_pages = 1 << compound_trans_order(page);
> +	if (PageMlocked(page))
> +		goto out;
> +
> +	/*
> +	 * Anon and SwapCache pages can be identified without locking.
> +	 * For all other cases, we need the page locked in order to
> +	 * dereference page->mapping.
> +	 */
> +	if (PageAnon(page) || PageSwapCache(page))
> +		is_file = false;
> +	else if (!trylock_page(page)) {
> +		/*
> +		 * We need to lock the page to dereference the mapping.
> +		 * But don't risk sleeping by calling lock_page().
> +		 * We don't want to stall kstaled, so we conservatively
> +		 * count locked pages as unreclaimable.
> +		 */
> +		goto out;
> +	} else {
> +		struct address_space *mapping = page->mapping;
> +
> +		is_locked = true;
> +
> +		/*
> +		 * The page is still anon - it has been continuously referenced
> +		 * since the prior check.
> +		 */
> +		VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
> +
> +		/*
> +		 * Check the mapping under protection of the page lock.
> +		 * 1. If the page is not swap cache and has no mapping,
> +		 *    shrink_page_list can't do anything with it.
> +		 * 2. If the mapping is unevictable (as in SHM_LOCK segments),
> +		 *    shrink_page_list can't do anything with it.
> +		 * 3. If the page is swap cache or the mapping is swap backed
> +		 *    (as in shmem), consider it a swappable page.
> +		 * 4. If the backing dev has indicated that it does not want
> +		 *    its pages sync'd to disk (as in ramfs), take this as
> +		 *    a hint that its pages are not reclaimable.
> +		 * 5. Otherwise, consider this as a file page reclaimable
> +		 *    through standard pageout.
> +		 */
> +		if (!mapping && !PageSwapCache(page))
> +			goto out;
> +		else if (mapping_unevictable(mapping))
> +			goto out;
> +		else if (PageSwapCache(page) ||
> +			 mapping_cap_swap_backed(mapping))
> +			is_file = false;
> +		else if (!mapping_cap_writeback_dirty(mapping))
> +			goto out;
> +		else
> +			is_file = true;
> +	}
> +
> +	/* Find out if the page is idle. Also test for pending mlock. */
> +	page_referenced_kstaled(page, is_locked, &info);
> +	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
> +		goto out;
> +
> +	/* Locate kstaled stats for the page's cgroup. */
> +	pc = lookup_page_cgroup(page);
> +	if (!pc)
> +		goto out;

This !pc check is not required.

> +	lock_page_cgroup(pc);
> +	if (!PageCgroupUsed(pc))
> +		goto unlock_page_cgroup_out;
> +	stats = &pc->mem_cgroup->idle_scan_stats;
> +
> +	/* Finally increment the correct statistic for this page. */
> +	if (!(info.pr_flags & PR_DIRTY) &&
> +	    !PageDirty(page) && !PageWriteback(page))
> +		stats->idle_clean += nr_pages;
> +	else if (is_file)
> +		stats->idle_dirty_file += nr_pages;
> +	else
> +		stats->idle_dirty_swap += nr_pages;
> +
> + unlock_page_cgroup_out:
> +	unlock_page_cgroup(pc);
> +

unlock_page_out:
	unlock_page(page);
out:
	put_page(page);

?

Hm, btw, if you put 'stats' into per-zone struct of memcg,
you'll have a chance to get per-node/zone idle stats.
you don't want it ?




> + out:
> +	if (is_locked)
> +		unlock_page(page);
> +	put_page(page);
> +
> +	return nr_pages;
> +}
> +
> +static void kstaled_scan_node(pg_data_t *pgdat)
> +{
> +	unsigned long flags;
> +	unsigned long pfn, end;
> +
> +	pgdat_resize_lock(pgdat, &flags);
> +

pgdat_resize_lock() is a spin lock irq disabling..so
IRQ will be blocked while you do scanning.

I think lock_memory_hotplug() will be better and I think it's enough.


> +	pfn = pgdat->node_start_pfn;
> +	end = pfn + pgdat->node_spanned_pages;
> +
> +	while (pfn < end) {
> +		if (need_resched()) {
> +			pgdat_resize_unlock(pgdat, &flags);
> +			cond_resched();
> +			pgdat_resize_lock(pgdat, &flags);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +			/* abort if the node got resized */
> +			if (pfn < pgdat->node_start_pfn ||
> +			    end > (pgdat->node_start_pfn +
> +				   pgdat->node_spanned_pages))
> +				goto abort;
> +#endif
> +		}
> +
> +		pfn += pfn_valid(pfn) ?
> +			kstaled_scan_page(pfn_to_page(pfn)) : 1;

There is a server which has following node layout as:

pfn 0 <---------------------------------------------> max_pfn
      node0 node1 node2 node3 node0 node1 node2 node3

Then, you may scan pages multiple times. please check node id.



> +	}
> +
> +abort:
> +	pgdat_resize_unlock(pgdat, &flags);
> +}
> +
> +static int kstaled(void *dummy)
> +{
> +	while (1) {
> +		int scan_seconds;
> +		int nid;
> +		struct mem_cgroup *memcg;
> +
> +		wait_event_interruptible(kstaled_wait,
> +				 (scan_seconds = kstaled_scan_seconds) > 0);
> +		/*
> +		 * We use interruptible wait_event so as not to contribute
> +		 * to the machine load average while we're sleeping.
> +		 * However, we don't actually expect to receive a signal
> +		 * since we run as a kernel thread, so the condition we were
> +		 * waiting for should be true once we get here.
> +		 */
> +		BUG_ON(scan_seconds <= 0);
> +
> +		for_each_mem_cgroup_all(memcg)
> +			memset(&memcg->idle_scan_stats, 0,
> +			       sizeof(memcg->idle_scan_stats));
> +
> +		for_each_node_state(nid, N_HIGH_MEMORY)
> +			kstaled_scan_node(NODE_DATA(nid));
> +
> +		for_each_mem_cgroup_all(memcg) {
> +			write_seqcount_begin(&memcg->idle_page_stats_lock);
> +			memcg->idle_page_stats = memcg->idle_scan_stats;
> +			memcg->idle_page_scans++;
> +			write_seqcount_end(&memcg->idle_page_stats_lock);
> +		}
> +
> +		schedule_timeout_interruptible(scan_seconds * HZ);

Hm, timeout is the best trigger ?




> +	}
> +
> +	BUG();
> +	return 0;	/* NOT REACHED */
> +}
> +
> +static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
> +					 struct kobj_attribute *attr,
> +					 char *buf)
> +{
> +	return sprintf(buf, "%u\n", kstaled_scan_seconds);
> +}
> +
> +static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
> +					  struct kobj_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long input;
> +
> +	err = kstrtoul(buf, 10, &input);
> +	if (err)
> +		return -EINVAL;
> +	kstaled_scan_seconds = input;
> +	wake_up_interruptible(&kstaled_wait);
> +	return count;
> +}
> +

How the user should calculated the scan interval ?
Can't this be selected in (semi-)automatic way ?

Thanks
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
  2011-09-28  7:41     ` Peter Zijlstra
@ 2011-09-28  8:01       ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  8:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Andrea Arcangeli,
	Johannes Weiner, KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, Sep 28, 2011 at 12:41 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-09-27 at 17:49 -0700, Michel Lespinasse wrote:
>> +static int kstaled(void *dummy)
>> +{
>> +       while (1) {
>
>> +       }
>> +
>> +       BUG();
>> +       return 0;       /* NOT REACHED */
>> +}
>
> So if you build with this junk (as I presume distro's will), there is no
> way to disable it?

There will be a thread, and it'll block in wait_event_interruptible()
until a positive value is written into
/sys/kernel/mm/kstaled/scan_seconds

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
@ 2011-09-28  8:01       ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  8:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Andrea Arcangeli,
	Johannes Weiner, KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, Sep 28, 2011 at 12:41 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-09-27 at 17:49 -0700, Michel Lespinasse wrote:
>> +static int kstaled(void *dummy)
>> +{
>> +       while (1) {
>
>> +       }
>> +
>> +       BUG();
>> +       return 0;       /* NOT REACHED */
>> +}
>
> So if you build with this junk (as I presume distro's will), there is no
> way to disable it?

There will be a thread, and it'll block in wait_event_interruptible()
until a positive value is written into
/sys/kernel/mm/kstaled/scan_seconds

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/9] kstaled: skip non-RAM regions.
  2011-09-28  0:49   ` Michel Lespinasse
@ 2011-09-28  8:03     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:03 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:03 -0700
Michel Lespinasse <walken@google.com> wrote:

> Add a pfn_skip_hole function that shrinks the passed input range in order to
> skip over pfn ranges that are known not bo be RAM backed. The x86
> implementation achieves this using e820 tables; other architectures
> use a generic no-op implementation.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Hm, can't you use walk_system_ram_range() in kernel/resource.c ?
If it's enough, please update it.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/9] kstaled: skip non-RAM regions.
@ 2011-09-28  8:03     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:03 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:03 -0700
Michel Lespinasse <walken@google.com> wrote:

> Add a pfn_skip_hole function that shrinks the passed input range in order to
> skip over pfn ranges that are known not bo be RAM backed. The x86
> implementation achieves this using e820 tables; other architectures
> use a generic no-op implementation.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Hm, can't you use walk_system_ram_range() in kernel/resource.c ?
If it's enough, please update it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
  2011-09-28  0:49   ` Michel Lespinasse
@ 2011-09-28  8:13     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:13 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:04 -0700
Michel Lespinasse <walken@google.com> wrote:

> Scan some number of pages from each node every second, instead of trying to
> scan the entime memory at once and being idle for the rest of the configured
> interval.
> 
> In addition to spreading the CPU usage over the entire scanning interval,
> this also reduces the jitter between two consecutive scans of the same page.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Does this scan thread need to be signle thread ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
@ 2011-09-28  8:13     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:13 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:04 -0700
Michel Lespinasse <walken@google.com> wrote:

> Scan some number of pages from each node every second, instead of trying to
> scan the entime memory at once and being idle for the rest of the configured
> interval.
> 
> In addition to spreading the CPU usage over the entire scanning interval,
> this also reduces the jitter between two consecutive scans of the same page.
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Does this scan thread need to be signle thread ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
  2011-09-28  8:13     ` KAMEZAWA Hiroyuki
@ 2011-09-28  8:19       ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  8:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, Sep 28, 2011 at 1:13 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 27 Sep 2011 17:49:04 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> Scan some number of pages from each node every second, instead of trying to
>> scan the entime memory at once and being idle for the rest of the configured
>> interval.
>>
>> In addition to spreading the CPU usage over the entire scanning interval,
>> this also reduces the jitter between two consecutive scans of the same page.
>>
>>
>> Signed-off-by: Michel Lespinasse <walken@google.com>
>
> Does this scan thread need to be signle thread ?

It tends to perform worse if we try making it multithreaded. What
happens is that the scanning threads call page_referenced() a lot, and
if they both try scanning pages that belong to the same file that
causes the mapping's i_mmap_mutex lock to bounce. Same things happens
if they try scanning pages that belong to the same anon VMA too.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
@ 2011-09-28  8:19       ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28  8:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, Sep 28, 2011 at 1:13 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 27 Sep 2011 17:49:04 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> Scan some number of pages from each node every second, instead of trying to
>> scan the entime memory at once and being idle for the rest of the configured
>> interval.
>>
>> In addition to spreading the CPU usage over the entire scanning interval,
>> this also reduces the jitter between two consecutive scans of the same page.
>>
>>
>> Signed-off-by: Michel Lespinasse <walken@google.com>
>
> Does this scan thread need to be signle thread ?

It tends to perform worse if we try making it multithreaded. What
happens is that the scanning threads call page_referenced() a lot, and
if they both try scanning pages that belong to the same file that
causes the mapping's i_mmap_mutex lock to bounce. Same things happens
if they try scanning pages that belong to the same anon VMA too.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 7/9] kstaled: add histogram sampling functionality
  2011-09-28  0:49   ` Michel Lespinasse
@ 2011-09-28  8:22     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:22 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:05 -0700
Michel Lespinasse <walken@google.com> wrote:

> Add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
> 240 scan intervals into /dev/cgroup/*/memory.idle_page_stats
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  include/linux/mmzone.h |    2 +
>  mm/memcontrol.c        |  108 ++++++++++++++++++++++++++++++++++++++----------
>  mm/memory_hotplug.c    |    6 +++
>  3 files changed, 94 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 272fbed..d8eca1b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -633,6 +633,8 @@ typedef struct pglist_data {
>  					     range, including holes */
>  #ifdef CONFIG_KSTALED
>  	unsigned long node_idle_scan_pfn;
> +	u8 *node_idle_page_age;           /* number of scan intervals since
> +					     each page was referenced */
>  #endif
>  	int node_id;
>  	wait_queue_head_t kswapd_wait;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b468867..cfe812b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -207,6 +207,11 @@ struct mem_cgroup_eventfd_list {
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>  static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>  
> +#ifdef CONFIG_KSTALED
> +static const int kstaled_buckets[] = {1, 2, 5, 15, 30, 60, 120, 240};
> +#define NUM_KSTALED_BUCKETS ARRAY_SIZE(kstaled_buckets)
> +#endif
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -292,7 +297,8 @@ struct mem_cgroup {
>  		unsigned long idle_clean;
>  		unsigned long idle_dirty_file;
>  		unsigned long idle_dirty_swap;
> -	} idle_page_stats, idle_scan_stats;
> +	} idle_page_stats[NUM_KSTALED_BUCKETS],
> +	  idle_scan_stats[NUM_KSTALED_BUCKETS];
>  	unsigned long idle_page_scans;
>  #endif
>  };
> @@ -4686,18 +4692,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>  	unsigned int seqcount;
> -	struct idle_page_stats stats;
> +	struct idle_page_stats stats[NUM_KSTALED_BUCKETS];
>  	unsigned long scans;
> +	int bucket;
>  
>  	do {
>  		seqcount = read_seqcount_begin(&memcg->idle_page_stats_lock);
> -		stats = memcg->idle_page_stats;
> +		memcpy(stats, memcg->idle_page_stats, sizeof(stats));
>  		scans = memcg->idle_page_scans;
>  	} while (read_seqcount_retry(&memcg->idle_page_stats_lock, seqcount));
>  
> -	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
> -	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
> -	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
> +	for (bucket = 0; bucket < NUM_KSTALED_BUCKETS; bucket++) {
> +		char basename[32], name[32];
> +		if (!bucket)
> +			sprintf(basename, "idle");
> +		else
> +			sprintf(basename, "idle_%d", kstaled_buckets[bucket]);
> +		sprintf(name, "%s_clean", basename);
> +		cb->fill(cb, name, stats[bucket].idle_clean * PAGE_SIZE);
> +		sprintf(name, "%s_dirty_file", basename);
> +		cb->fill(cb, name, stats[bucket].idle_dirty_file * PAGE_SIZE);
> +		sprintf(name, "%s_dirty_swap", basename);
> +		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
> +	}
>  	cb->fill(cb, "scans", scans);
>  
>  	return 0;
> @@ -5619,12 +5636,25 @@ __setup("swapaccount=", enable_swap_account);
>  static unsigned int kstaled_scan_seconds;
>  static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
>  
> -static unsigned kstaled_scan_page(struct page *page)
> +static inline struct idle_page_stats *
> +kstaled_idle_stats(struct mem_cgroup *memcg, int age)
> +{
> +	int bucket = 0;
> +
> +	while (age >= kstaled_buckets[bucket + 1])
> +		if (++bucket == NUM_KSTALED_BUCKETS - 1)
> +			break;
> +	return memcg->idle_scan_stats + bucket;
> +}
> +
> +static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
>  {
>  	bool is_locked = false;
>  	bool is_file;
>  	struct page_referenced_info info;
>  	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg;
> +	int age;
>  	struct idle_page_stats *stats;
>  	unsigned nr_pages;
>  
> @@ -5704,17 +5734,25 @@ static unsigned kstaled_scan_page(struct page *page)
>  
>  	/* Find out if the page is idle. Also test for pending mlock. */
>  	page_referenced_kstaled(page, is_locked, &info);
> -	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
> +	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED)) {
> +		*idle_page_age = 0;
>  		goto out;
> +	}
>  
>  	/* Locate kstaled stats for the page's cgroup. */
>  	pc = lookup_page_cgroup(page);
>  	if (!pc)
>  		goto out;
>  	lock_page_cgroup(pc);
> +	memcg = pc->mem_cgroup;
>  	if (!PageCgroupUsed(pc))
>  		goto unlock_page_cgroup_out;
> -	stats = &pc->mem_cgroup->idle_scan_stats;
> +
> +	/* Page is idle, increment its age and get the right stats bucket */
> +	age = *idle_page_age;
> +	if (age < 255)
> +		*idle_page_age = ++age;
> +	stats = kstaled_idle_stats(memcg, age);
>  
>  	/* Finally increment the correct statistic for this page. */
>  	if (!(info.pr_flags & PR_DIRTY) &&
> @@ -5740,11 +5778,22 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
>  {
>  	unsigned long flags;
>  	unsigned long pfn, end, node_end;
> +	u8 *idle_page_age;
>  
>  	pgdat_resize_lock(pgdat, &flags);
>  
> +	if (!pgdat->node_idle_page_age) {
> +		pgdat->node_idle_page_age = vmalloc(pgdat->node_spanned_pages);

Hmm, on 2T host, this requires 

   1024 * 1024 * 1024 * 1024 * 2 / 4096 = 512MB at least..
And will includes huge memory holes ;)

Can't you use some some calculation as load_avg or some ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 7/9] kstaled: add histogram sampling functionality
@ 2011-09-28  8:22     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:22 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, 27 Sep 2011 17:49:05 -0700
Michel Lespinasse <walken@google.com> wrote:

> Add statistics for pages that have been idle for 1,2,5,15,30,60,120 or
> 240 scan intervals into /dev/cgroup/*/memory.idle_page_stats
> 
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  include/linux/mmzone.h |    2 +
>  mm/memcontrol.c        |  108 ++++++++++++++++++++++++++++++++++++++----------
>  mm/memory_hotplug.c    |    6 +++
>  3 files changed, 94 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 272fbed..d8eca1b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -633,6 +633,8 @@ typedef struct pglist_data {
>  					     range, including holes */
>  #ifdef CONFIG_KSTALED
>  	unsigned long node_idle_scan_pfn;
> +	u8 *node_idle_page_age;           /* number of scan intervals since
> +					     each page was referenced */
>  #endif
>  	int node_id;
>  	wait_queue_head_t kswapd_wait;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b468867..cfe812b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -207,6 +207,11 @@ struct mem_cgroup_eventfd_list {
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>  static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>  
> +#ifdef CONFIG_KSTALED
> +static const int kstaled_buckets[] = {1, 2, 5, 15, 30, 60, 120, 240};
> +#define NUM_KSTALED_BUCKETS ARRAY_SIZE(kstaled_buckets)
> +#endif
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -292,7 +297,8 @@ struct mem_cgroup {
>  		unsigned long idle_clean;
>  		unsigned long idle_dirty_file;
>  		unsigned long idle_dirty_swap;
> -	} idle_page_stats, idle_scan_stats;
> +	} idle_page_stats[NUM_KSTALED_BUCKETS],
> +	  idle_scan_stats[NUM_KSTALED_BUCKETS];
>  	unsigned long idle_page_scans;
>  #endif
>  };
> @@ -4686,18 +4692,29 @@ static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>  	unsigned int seqcount;
> -	struct idle_page_stats stats;
> +	struct idle_page_stats stats[NUM_KSTALED_BUCKETS];
>  	unsigned long scans;
> +	int bucket;
>  
>  	do {
>  		seqcount = read_seqcount_begin(&memcg->idle_page_stats_lock);
> -		stats = memcg->idle_page_stats;
> +		memcpy(stats, memcg->idle_page_stats, sizeof(stats));
>  		scans = memcg->idle_page_scans;
>  	} while (read_seqcount_retry(&memcg->idle_page_stats_lock, seqcount));
>  
> -	cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
> -	cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
> -	cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
> +	for (bucket = 0; bucket < NUM_KSTALED_BUCKETS; bucket++) {
> +		char basename[32], name[32];
> +		if (!bucket)
> +			sprintf(basename, "idle");
> +		else
> +			sprintf(basename, "idle_%d", kstaled_buckets[bucket]);
> +		sprintf(name, "%s_clean", basename);
> +		cb->fill(cb, name, stats[bucket].idle_clean * PAGE_SIZE);
> +		sprintf(name, "%s_dirty_file", basename);
> +		cb->fill(cb, name, stats[bucket].idle_dirty_file * PAGE_SIZE);
> +		sprintf(name, "%s_dirty_swap", basename);
> +		cb->fill(cb, name, stats[bucket].idle_dirty_swap * PAGE_SIZE);
> +	}
>  	cb->fill(cb, "scans", scans);
>  
>  	return 0;
> @@ -5619,12 +5636,25 @@ __setup("swapaccount=", enable_swap_account);
>  static unsigned int kstaled_scan_seconds;
>  static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
>  
> -static unsigned kstaled_scan_page(struct page *page)
> +static inline struct idle_page_stats *
> +kstaled_idle_stats(struct mem_cgroup *memcg, int age)
> +{
> +	int bucket = 0;
> +
> +	while (age >= kstaled_buckets[bucket + 1])
> +		if (++bucket == NUM_KSTALED_BUCKETS - 1)
> +			break;
> +	return memcg->idle_scan_stats + bucket;
> +}
> +
> +static unsigned kstaled_scan_page(struct page *page, u8 *idle_page_age)
>  {
>  	bool is_locked = false;
>  	bool is_file;
>  	struct page_referenced_info info;
>  	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg;
> +	int age;
>  	struct idle_page_stats *stats;
>  	unsigned nr_pages;
>  
> @@ -5704,17 +5734,25 @@ static unsigned kstaled_scan_page(struct page *page)
>  
>  	/* Find out if the page is idle. Also test for pending mlock. */
>  	page_referenced_kstaled(page, is_locked, &info);
> -	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
> +	if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED)) {
> +		*idle_page_age = 0;
>  		goto out;
> +	}
>  
>  	/* Locate kstaled stats for the page's cgroup. */
>  	pc = lookup_page_cgroup(page);
>  	if (!pc)
>  		goto out;
>  	lock_page_cgroup(pc);
> +	memcg = pc->mem_cgroup;
>  	if (!PageCgroupUsed(pc))
>  		goto unlock_page_cgroup_out;
> -	stats = &pc->mem_cgroup->idle_scan_stats;
> +
> +	/* Page is idle, increment its age and get the right stats bucket */
> +	age = *idle_page_age;
> +	if (age < 255)
> +		*idle_page_age = ++age;
> +	stats = kstaled_idle_stats(memcg, age);
>  
>  	/* Finally increment the correct statistic for this page. */
>  	if (!(info.pr_flags & PR_DIRTY) &&
> @@ -5740,11 +5778,22 @@ static bool kstaled_scan_node(pg_data_t *pgdat, int scan_seconds, bool reset)
>  {
>  	unsigned long flags;
>  	unsigned long pfn, end, node_end;
> +	u8 *idle_page_age;
>  
>  	pgdat_resize_lock(pgdat, &flags);
>  
> +	if (!pgdat->node_idle_page_age) {
> +		pgdat->node_idle_page_age = vmalloc(pgdat->node_spanned_pages);

Hmm, on 2T host, this requires 

   1024 * 1024 * 1024 * 1024 * 2 / 4096 = 512MB at least..
And will includes huge memory holes ;)

Can't you use some some calculation as load_avg or some ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
  2011-09-28  8:19       ` Michel Lespinasse
@ 2011-09-28  8:59         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:59 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, 28 Sep 2011 01:19:50 -0700
Michel Lespinasse <walken@google.com> wrote:

> On Wed, Sep 28, 2011 at 1:13 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 27 Sep 2011 17:49:04 -0700
> > Michel Lespinasse <walken@google.com> wrote:
> >
> >> Scan some number of pages from each node every second, instead of trying to
> >> scan the entime memory at once and being idle for the rest of the configured
> >> interval.
> >>
> >> In addition to spreading the CPU usage over the entire scanning interval,
> >> this also reduces the jitter between two consecutive scans of the same page.
> >>
> >>
> >> Signed-off-by: Michel Lespinasse <walken@google.com>
> >
> > Does this scan thread need to be signle thread ?
> 
> It tends to perform worse if we try making it multithreaded. What
> happens is that the scanning threads call page_referenced() a lot, and
> if they both try scanning pages that belong to the same file that
> causes the mapping's i_mmap_mutex lock to bounce. Same things happens
> if they try scanning pages that belong to the same anon VMA too.
> 

Hmm. with brief thinking, if you can scan list of page tables,
you can set young flags without any locks. 
For inode pages, you can hook page lookup, I think.

You only need to clear Young flag by scanning [pfn, end_pfn].
Then, multi-threaded. ?


Thanks,
-Kame




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
@ 2011-09-28  8:59         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-28  8:59 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, 28 Sep 2011 01:19:50 -0700
Michel Lespinasse <walken@google.com> wrote:

> On Wed, Sep 28, 2011 at 1:13 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 27 Sep 2011 17:49:04 -0700
> > Michel Lespinasse <walken@google.com> wrote:
> >
> >> Scan some number of pages from each node every second, instead of trying to
> >> scan the entime memory at once and being idle for the rest of the configured
> >> interval.
> >>
> >> In addition to spreading the CPU usage over the entire scanning interval,
> >> this also reduces the jitter between two consecutive scans of the same page.
> >>
> >>
> >> Signed-off-by: Michel Lespinasse <walken@google.com>
> >
> > Does this scan thread need to be signle thread ?
> 
> It tends to perform worse if we try making it multithreaded. What
> happens is that the scanning threads call page_referenced() a lot, and
> if they both try scanning pages that belong to the same file that
> causes the mapping's i_mmap_mutex lock to bounce. Same things happens
> if they try scanning pages that belong to the same anon VMA too.
> 

Hmm. with brief thinking, if you can scan list of page tables,
you can set young flags without any locks. 
For inode pages, you can hook page lookup, I think.

You only need to clear Young flag by scanning [pfn, end_pfn].
Then, multi-threaded. ?


Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
  2011-09-28  8:01       ` Michel Lespinasse
@ 2011-09-28 10:26         ` Peter Zijlstra
  -1 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2011-09-28 10:26 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Andrea Arcangeli,
	Johannes Weiner, KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, 2011-09-28 at 01:01 -0700, Michel Lespinasse wrote:
> On Wed, Sep 28, 2011 at 12:41 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2011-09-27 at 17:49 -0700, Michel Lespinasse wrote:
> >> +static int kstaled(void *dummy)
> >> +{
> >> +       while (1) {
> >
> >> +       }
> >> +
> >> +       BUG();
> >> +       return 0;       /* NOT REACHED */
> >> +}
> >
> > So if you build with this junk (as I presume distro's will), there is no
> > way to disable it?
> 
> There will be a thread, and it'll block in wait_event_interruptible()
> until a positive value is written into
> /sys/kernel/mm/kstaled/scan_seconds

And here I though people wanted less pointless kernel threads..

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
@ 2011-09-28 10:26         ` Peter Zijlstra
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2011-09-28 10:26 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Andrea Arcangeli,
	Johannes Weiner, KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, 2011-09-28 at 01:01 -0700, Michel Lespinasse wrote:
> On Wed, Sep 28, 2011 at 12:41 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2011-09-27 at 17:49 -0700, Michel Lespinasse wrote:
> >> +static int kstaled(void *dummy)
> >> +{
> >> +       while (1) {
> >
> >> +       }
> >> +
> >> +       BUG();
> >> +       return 0;       /* NOT REACHED */
> >> +}
> >
> > So if you build with this junk (as I presume distro's will), there is no
> > way to disable it?
> 
> There will be a thread, and it'll block in wait_event_interruptible()
> until a positive value is written into
> /sys/kernel/mm/kstaled/scan_seconds

And here I though people wanted less pointless kernel threads..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/9] kstaled: documentation and config option.
  2011-09-28  6:53     ` KAMEZAWA Hiroyuki
@ 2011-09-28 23:48       ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28 23:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, Sep 27, 2011 at 11:53 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 27 Sep 2011 17:49:00 -0700
> Michel Lespinasse <walken@google.com> wrote:
>> +* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
>> +  above, but for pages that have been untouched for at least two scan cycles.
>> +* these fields repeat up to idle_240_clean, idle_240_dirty_file and
>> +  idle_240_dirty_swap, allowing one to observe idle pages over a variety
>> +  of idle interval lengths. Note that the accounting is cumulative:
>> +  pages counted as idle for a given interval length are also counted
>> +  as idle for smaller interval lengths.
>
> I'm sorry if you've answered already.
>
> Why 240 ? and above means we have idle_xxx_clean/dirty/ xxx is 'seq 2 240' ?
> Isn't it messy ? Anyway, idle_1_clean etc should be provided.

We don't have all values - we export values for 1, 2, 5, 15, 30, 60,
120 and 240 idle scan intervals.
In our production setup, the scan interval is set at 120 seconds.
The exported histogram values are chosen so that each is approximately
double as the previous, and they align with human units i.e. 30 scan
intervals == 1 hour.
We use one byte per page to track the number of idle cycles, which is
why we don't export anything over 255 scan intervals

> Hmm, I don't like the idea very much...
>
> IIUC, there is no kernel interface which shows histgram rather than load_avg[].
> Is there any other interface and what histgram is provided ?
> And why histgram by kernel is required ?

I don't think exporting per-page statistics is very useful given that
userspace doesn't have a way to select individual pages to reclaim
(and if it did, we would have to expose LRU lists to userspace for it
to make good choices, and I don't think we want to go there). So, we
want to expose summary statistics instead. Histograms are a good way
to do that.

I don't think averages would work well for this application - the
distribution of idle page ages varies a lot between applications and
can't be assumed to be even close to a gaussian.

> BTW, can't this information be exported by /proc/<pid>/smaps or somewhere ?
> I guess per-proc will be wanted finally.

The problem with per-proc is that it only works for things that are
mapped in at the time you look at the report. It does not take into
consideration ephemeral mappings (i.e. if there is this thing you run
every 5 minutes and it needs 1G of memory) or files you access with
read() instead of mmap().

> Hm, do you use params other than idle_clean for your scheduling ?

The management software currently looks at only one bin of the
histogram - for each job, we can configure which bin it will look at.
Humans look at the complete picture when looking into performance
issues, and we're always thinking about teaching the management
software to do that as well :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/9] kstaled: documentation and config option.
@ 2011-09-28 23:48       ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-28 23:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Tue, Sep 27, 2011 at 11:53 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 27 Sep 2011 17:49:00 -0700
> Michel Lespinasse <walken@google.com> wrote:
>> +* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
>> +  above, but for pages that have been untouched for at least two scan cycles.
>> +* these fields repeat up to idle_240_clean, idle_240_dirty_file and
>> +  idle_240_dirty_swap, allowing one to observe idle pages over a variety
>> +  of idle interval lengths. Note that the accounting is cumulative:
>> +  pages counted as idle for a given interval length are also counted
>> +  as idle for smaller interval lengths.
>
> I'm sorry if you've answered already.
>
> Why 240 ? and above means we have idle_xxx_clean/dirty/ xxx is 'seq 2 240' ?
> Isn't it messy ? Anyway, idle_1_clean etc should be provided.

We don't have all values - we export values for 1, 2, 5, 15, 30, 60,
120 and 240 idle scan intervals.
In our production setup, the scan interval is set at 120 seconds.
The exported histogram values are chosen so that each is approximately
double as the previous, and they align with human units i.e. 30 scan
intervals == 1 hour.
We use one byte per page to track the number of idle cycles, which is
why we don't export anything over 255 scan intervals

> Hmm, I don't like the idea very much...
>
> IIUC, there is no kernel interface which shows histgram rather than load_avg[].
> Is there any other interface and what histgram is provided ?
> And why histgram by kernel is required ?

I don't think exporting per-page statistics is very useful given that
userspace doesn't have a way to select individual pages to reclaim
(and if it did, we would have to expose LRU lists to userspace for it
to make good choices, and I don't think we want to go there). So, we
want to expose summary statistics instead. Histograms are a good way
to do that.

I don't think averages would work well for this application - the
distribution of idle page ages varies a lot between applications and
can't be assumed to be even close to a gaussian.

> BTW, can't this information be exported by /proc/<pid>/smaps or somewhere ?
> I guess per-proc will be wanted finally.

The problem with per-proc is that it only works for things that are
mapped in at the time you look at the report. It does not take into
consideration ephemeral mappings (i.e. if there is this thing you run
every 5 minutes and it needs 1G of memory) or files you access with
read() instead of mmap().

> Hm, do you use params other than idle_clean for your scheduling ?

The management software currently looks at only one bin of the
histogram - for each job, we can configure which bin it will look at.
Humans look at the complete picture when looking into performance
issues, and we're always thinking about teaching the management
software to do that as well :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/9] kstaled: page_referenced_kstaled() and supporting infrastructure.
  2011-09-28  7:18     ` KAMEZAWA Hiroyuki
@ 2011-09-29  0:09       ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-29  0:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, Sep 28, 2011 at 12:18 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 2 questions.
>
> What happens at Transparent HugeTLB pages are splitted/collapsed ?

Nothing special - at the next scan, pages are counted again
considering their new size.

> Does this feature can ignore page migration i.e. flags should not be copied ?

We're not doing it currently. As I understand, the migrate code does
not copy the PTE young bits either, nor does it try to preserve page
order in the LRU lists. So it's not transparent to the LRU algorithms,
but it does not cause incorrect behavior either.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/9] kstaled: page_referenced_kstaled() and supporting infrastructure.
@ 2011-09-29  0:09       ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-29  0:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, Sep 28, 2011 at 12:18 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 2 questions.
>
> What happens at Transparent HugeTLB pages are splitted/collapsed ?

Nothing special - at the next scan, pages are counted again
considering their new size.

> Does this feature can ignore page migration i.e. flags should not be copied ?

We're not doing it currently. As I understand, the migrate code does
not copy the PTE young bits either, nor does it try to preserve page
order in the LRU lists. So it's not transparent to the LRU algorithms,
but it does not cause incorrect behavior either.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/9] kstaled: documentation and config option.
  2011-09-28 23:48       ` Michel Lespinasse
@ 2011-09-29  5:40         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-29  5:40 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, 28 Sep 2011 16:48:44 -0700
Michel Lespinasse <walken@google.com> wrote:

> On Tue, Sep 27, 2011 at 11:53 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 27 Sep 2011 17:49:00 -0700
> > Michel Lespinasse <walken@google.com> wrote:
> >> +* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
> >> +  above, but for pages that have been untouched for at least two scan cycles.
> >> +* these fields repeat up to idle_240_clean, idle_240_dirty_file and
> >> +  idle_240_dirty_swap, allowing one to observe idle pages over a variety
> >> +  of idle interval lengths. Note that the accounting is cumulative:
> >> +  pages counted as idle for a given interval length are also counted
> >> +  as idle for smaller interval lengths.
> >
> > I'm sorry if you've answered already.
> >
> > Why 240 ? and above means we have idle_xxx_clean/dirty/ xxx is 'seq 2 240' ?
> > Isn't it messy ? Anyway, idle_1_clean etc should be provided.
> 
> We don't have all values - we export values for 1, 2, 5, 15, 30, 60,
> 120 and 240 idle scan intervals.
> In our production setup, the scan interval is set at 120 seconds.
> The exported histogram values are chosen so that each is approximately
> double as the previous, and they align with human units i.e. 30 scan
> intervals == 1 hour.
> We use one byte per page to track the number of idle cycles, which is
> why we don't export anything over 255 scan intervals
> 

If LRU is divided into 1,2,5,15,30,60,120,240 intervals, ok, I think having
this statistics in the kernel means something..
Do you have any plan to using the aging value for global LRU scheduling ?


BTW, how about having 'aging' and 'histgram' on demand ?

Now, you do all scan by a thread and does aging by counter. But having
   - scan thread per interval
   - alloc bitmap (for PG_young, PG_idle) per scan thread.
will allow you to have arbitrary scan_interval/histgram and to avoid
to have unnecessary data.
 
Then, the users can get the histgram they want. Users will be able to
get 12h, 24h histgram. But each threads will use 2bit per pages.

Off topic:
you allocated 'aging' array in pgdat. please allocate it per secion
if CONFIG_SPARSEMEM. Then, you can handle memory hotplug easily.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/9] kstaled: documentation and config option.
@ 2011-09-29  5:40         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-29  5:40 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, 28 Sep 2011 16:48:44 -0700
Michel Lespinasse <walken@google.com> wrote:

> On Tue, Sep 27, 2011 at 11:53 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 27 Sep 2011 17:49:00 -0700
> > Michel Lespinasse <walken@google.com> wrote:
> >> +* idle_2_clean, idle_2_dirty_file, idle_2_dirty_swap: same definitions as
> >> + A above, but for pages that have been untouched for at least two scan cycles.
> >> +* these fields repeat up to idle_240_clean, idle_240_dirty_file and
> >> + A idle_240_dirty_swap, allowing one to observe idle pages over a variety
> >> + A of idle interval lengths. Note that the accounting is cumulative:
> >> + A pages counted as idle for a given interval length are also counted
> >> + A as idle for smaller interval lengths.
> >
> > I'm sorry if you've answered already.
> >
> > Why 240 ? and above means we have idle_xxx_clean/dirty/ xxx is 'seq 2 240' ?
> > Isn't it messy ? Anyway, idle_1_clean etc should be provided.
> 
> We don't have all values - we export values for 1, 2, 5, 15, 30, 60,
> 120 and 240 idle scan intervals.
> In our production setup, the scan interval is set at 120 seconds.
> The exported histogram values are chosen so that each is approximately
> double as the previous, and they align with human units i.e. 30 scan
> intervals == 1 hour.
> We use one byte per page to track the number of idle cycles, which is
> why we don't export anything over 255 scan intervals
> 

If LRU is divided into 1,2,5,15,30,60,120,240 intervals, ok, I think having
this statistics in the kernel means something..
Do you have any plan to using the aging value for global LRU scheduling ?


BTW, how about having 'aging' and 'histgram' on demand ?

Now, you do all scan by a thread and does aging by counter. But having
   - scan thread per interval
   - alloc bitmap (for PG_young, PG_idle) per scan thread.
will allow you to have arbitrary scan_interval/histgram and to avoid
to have unnecessary data.
 
Then, the users can get the histgram they want. Users will be able to
get 12h, 24h histgram. But each threads will use 2bit per pages.

Off topic:
you allocated 'aging' array in pgdat. please allocate it per secion
if CONFIG_SPARSEMEM. Then, you can handle memory hotplug easily.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
  2011-09-28  0:48 ` Michel Lespinasse
                   ` (9 preceding siblings ...)
  (?)
@ 2011-09-29 16:43 ` Eric B Munson
  2011-09-29 20:25     ` Michel Lespinasse
  -1 siblings, 1 reply; 67+ messages in thread
From: Eric B Munson @ 2011-09-29 16:43 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

On Tue, 27 Sep 2011, Michel Lespinasse wrote:

> This is a followup to the prior version of this patchset, which I sent out
> on September 16.
> 
> I have addressed most of the basic feedback I got so far:
> 
> - Renamed struct pr_info -> struct page_referenced_info
> 
> - Config option now depends on 64BIT, as we may not have sufficient
>   free page flags in 32-bit builds
> 
> - Renamed mem -> memcg in kstaled code within memcontrol.c
> 
> - Uninlined kstaled_scan_page
> 
> - Replaced strict_strtoul -> kstrtoul
> 
> - Report PG_stale in /proc/kpageflags
> 
> - Fix accounting of THP pages. Sorry for forgeting to do this in the
>   V1 patchset - to detail the change here, what I had to do was make sure
>   page_referenced() reports THP pages as dirty (as they always are - the
>   dirty bit in the pmd is currently meaningless) and update the minimalistic
>   implementation change to count THP pages as equivalent to 512 small pages.
> 
> - The ugliest parts of patch 6 (rate limit pages scanned per second) have
>   been reworked. If the scanning thread gets delayed, it tries to catch up
>   so as to minimize jitter. If it can't catch up, it would probably be a
>   good idea to increase the scanning interval, but this is left up
>   to userspace.
> 

Michel,

I have been trying to test these patches since yesterday afternoon.  When my
machine is idle, they behave fine.  I started looking at performance to make
sure they were a big regression by testing kernel builds with the scanner
disabled, and then enabled (set to 120 seconds).  The scanner disabled builds
work fine, but with the scanner enabled the second time I build my kernel hangs
my machine every time.  Unfortunately, I do not have any more information than
that for you at the moment.  My next step is to try the same tests in qemu to
see if I can get more state information when the kernel hangs.

Eric

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
  2011-09-29 16:43 ` [PATCH 0/9] V2: idle page tracking / working set estimation Eric B Munson
@ 2011-09-29 20:25     ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-29 20:25 UTC (permalink / raw)
  To: Eric B Munson
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

On Thu, Sep 29, 2011 at 9:43 AM, Eric B Munson <emunson@mgebm.net> wrote:
> I have been trying to test these patches since yesterday afternoon.  When my
> machine is idle, they behave fine.  I started looking at performance to make
> sure they were a big regression by testing kernel builds with the scanner
> disabled, and then enabled (set to 120 seconds).  The scanner disabled builds
> work fine, but with the scanner enabled the second time I build my kernel hangs
> my machine every time.  Unfortunately, I do not have any more information than
> that for you at the moment.  My next step is to try the same tests in qemu to
> see if I can get more state information when the kernel hangs.

Could you please send me your .config file ? Also, did you apply the
patches on top of straight v3.0 and what is your machine like ?

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
@ 2011-09-29 20:25     ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-29 20:25 UTC (permalink / raw)
  To: Eric B Munson
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

On Thu, Sep 29, 2011 at 9:43 AM, Eric B Munson <emunson@mgebm.net> wrote:
> I have been trying to test these patches since yesterday afternoon.  When my
> machine is idle, they behave fine.  I started looking at performance to make
> sure they were a big regression by testing kernel builds with the scanner
> disabled, and then enabled (set to 120 seconds).  The scanner disabled builds
> work fine, but with the scanner enabled the second time I build my kernel hangs
> my machine every time.  Unfortunately, I do not have any more information than
> that for you at the moment.  My next step is to try the same tests in qemu to
> see if I can get more state information when the kernel hangs.

Could you please send me your .config file ? Also, did you apply the
patches on top of straight v3.0 and what is your machine like ?

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
  2011-09-29 20:25     ` Michel Lespinasse
@ 2011-09-29 21:18       ` Eric B Munson
  -1 siblings, 0 replies; 67+ messages in thread
From: Eric B Munson @ 2011-09-29 21:18 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

 On Thu, 29 Sep 2011 13:25:00 -0700, Michel Lespinasse wrote:
> On Thu, Sep 29, 2011 at 9:43 AM, Eric B Munson <emunson@mgebm.net> 
> wrote:
>> I have been trying to test these patches since yesterday afternoon. 
>>  When my
>> machine is idle, they behave fine.  I started looking at performance 
>> to make
>> sure they were a big regression by testing kernel builds with the 
>> scanner
>> disabled, and then enabled (set to 120 seconds).  The scanner 
>> disabled builds
>> work fine, but with the scanner enabled the second time I build my 
>> kernel hangs
>> my machine every time.  Unfortunately, I do not have any more 
>> information than
>> that for you at the moment.  My next step is to try the same tests 
>> in qemu to
>> see if I can get more state information when the kernel hangs.
>
> Could you please send me your .config file ? Also, did you apply the
> patches on top of straight v3.0 and what is your machine like ?
>
> Thanks,


 My .config will come separately to you.  I applied the patches to 
 Linus' master branch as of yesterday.  My machine is a single Xeon 5690 
 with 12G of ram (do you need more details than that?)

 Thanks,
 Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
@ 2011-09-29 21:18       ` Eric B Munson
  0 siblings, 0 replies; 67+ messages in thread
From: Eric B Munson @ 2011-09-29 21:18 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

 On Thu, 29 Sep 2011 13:25:00 -0700, Michel Lespinasse wrote:
> On Thu, Sep 29, 2011 at 9:43 AM, Eric B Munson <emunson@mgebm.net> 
> wrote:
>> I have been trying to test these patches since yesterday afternoon. 
>> A When my
>> machine is idle, they behave fine. A I started looking at performance 
>> to make
>> sure they were a big regression by testing kernel builds with the 
>> scanner
>> disabled, and then enabled (set to 120 seconds). A The scanner 
>> disabled builds
>> work fine, but with the scanner enabled the second time I build my 
>> kernel hangs
>> my machine every time. A Unfortunately, I do not have any more 
>> information than
>> that for you at the moment. A My next step is to try the same tests 
>> in qemu to
>> see if I can get more state information when the kernel hangs.
>
> Could you please send me your .config file ? Also, did you apply the
> patches on top of straight v3.0 and what is your machine like ?
>
> Thanks,


 My .config will come separately to you.  I applied the patches to 
 Linus' master branch as of yesterday.  My machine is a single Xeon 5690 
 with 12G of ram (do you need more details than that?)

 Thanks,
 Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
  2011-09-29 21:18       ` Eric B Munson
  (?)
@ 2011-09-30 18:19       ` Eric B Munson
  2011-09-30 21:16           ` Michel Lespinasse
  -1 siblings, 1 reply; 67+ messages in thread
From: Eric B Munson @ 2011-09-30 18:19 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

[-- Attachment #1: Type: text/plain, Size: 1842 bytes --]

On Thu, 29 Sep 2011, Eric B Munson wrote:

> On Thu, 29 Sep 2011 13:25:00 -0700, Michel Lespinasse wrote:
> >On Thu, Sep 29, 2011 at 9:43 AM, Eric B Munson <emunson@mgebm.net>
> >wrote:
> >>I have been trying to test these patches since yesterday
> >>afternoon.  When my
> >>machine is idle, they behave fine.  I started looking at
> >>performance to make
> >>sure they were a big regression by testing kernel builds with
> >>the scanner
> >>disabled, and then enabled (set to 120 seconds).  The scanner
> >>disabled builds
> >>work fine, but with the scanner enabled the second time I build
> >>my kernel hangs
> >>my machine every time.  Unfortunately, I do not have any more
> >>information than
> >>that for you at the moment.  My next step is to try the same
> >>tests in qemu to
> >>see if I can get more state information when the kernel hangs.
> >
> >Could you please send me your .config file ? Also, did you apply the
> >patches on top of straight v3.0 and what is your machine like ?
> >
> >Thanks,
> 
> 
> My .config will come separately to you.  I applied the patches to
> Linus' master branch as of yesterday.  My machine is a single Xeon
> 5690 with 12G of ram (do you need more details than that?)
> 
> Thanks,
> Eric

I am able to recreate on a second desktop I have here (same model CPU but a
different MB so I am fairly sure it isn't dying hardware).  It looks to me like
a CPU softlocks and it stalls the process active there, so most recently that
was XOrg.  The machine lets me login via ssh for a few minutes, but things like
ps and cat or /proc files will start to work and give some output but hang.
I cannot call reboot, nor can I sync the fs and reboot via SysRq.  My next step
is to setup a netconsole to see if anything comes out in the syslog that I
cannot see.

Eric

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
  2011-09-30 18:19       ` Eric B Munson
@ 2011-09-30 21:16           ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-30 21:16 UTC (permalink / raw)
  To: Eric B Munson
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

On Fri, Sep 30, 2011 at 11:19 AM, Eric B Munson <emunson@mgebm.net> wrote:
> I am able to recreate on a second desktop I have here (same model CPU but a
> different MB so I am fairly sure it isn't dying hardware).  It looks to me like
> a CPU softlocks and it stalls the process active there, so most recently that
> was XOrg.  The machine lets me login via ssh for a few minutes, but things like
> ps and cat or /proc files will start to work and give some output but hang.
> I cannot call reboot, nor can I sync the fs and reboot via SysRq.  My next step
> is to setup a netconsole to see if anything comes out in the syslog that I
> cannot see.

I haven't had time to try & reproduce locally yet (apologies - things
have been coming up at me).

But a prime suspect would be a bad interaction with
CONFIG_MEMORY_HOTPLUG, as Kamezama remarked in his reply to patch 4. I
think this could be the most likely cause of what you're observing.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
@ 2011-09-30 21:16           ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-09-30 21:16 UTC (permalink / raw)
  To: Eric B Munson
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

On Fri, Sep 30, 2011 at 11:19 AM, Eric B Munson <emunson@mgebm.net> wrote:
> I am able to recreate on a second desktop I have here (same model CPU but a
> different MB so I am fairly sure it isn't dying hardware).  It looks to me like
> a CPU softlocks and it stalls the process active there, so most recently that
> was XOrg.  The machine lets me login via ssh for a few minutes, but things like
> ps and cat or /proc files will start to work and give some output but hang.
> I cannot call reboot, nor can I sync the fs and reboot via SysRq.  My next step
> is to setup a netconsole to see if anything comes out in the syslog that I
> cannot see.

I haven't had time to try & reproduce locally yet (apologies - things
have been coming up at me).

But a prime suspect would be a bad interaction with
CONFIG_MEMORY_HOTPLUG, as Kamezama remarked in his reply to patch 4. I
think this could be the most likely cause of what you're observing.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
  2011-09-30 21:16           ` Michel Lespinasse
@ 2011-09-30 21:40             ` Eric B Munson
  -1 siblings, 0 replies; 67+ messages in thread
From: Eric B Munson @ 2011-09-30 21:40 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

 On Fri, 30 Sep 2011 14:16:25 -0700, Michel Lespinasse wrote:
> On Fri, Sep 30, 2011 at 11:19 AM, Eric B Munson <emunson@mgebm.net> 
> wrote:
>> I am able to recreate on a second desktop I have here (same model 
>> CPU but a
>> different MB so I am fairly sure it isn't dying hardware).  It looks 
>> to me like
>> a CPU softlocks and it stalls the process active there, so most 
>> recently that
>> was XOrg.  The machine lets me login via ssh for a few minutes, but 
>> things like
>> ps and cat or /proc files will start to work and give some output 
>> but hang.
>> I cannot call reboot, nor can I sync the fs and reboot via SysRq. 
>>  My next step
>> is to setup a netconsole to see if anything comes out in the syslog 
>> that I
>> cannot see.
>
> I haven't had time to try & reproduce locally yet (apologies - things
> have been coming up at me).
>
> But a prime suspect would be a bad interaction with
> CONFIG_MEMORY_HOTPLUG, as Kamezama remarked in his reply to patch 4. 
> I
> think this could be the most likely cause of what you're observing.

 I will try disabling Memory Hotplug on Monday and let you know if that 
 helps.

 Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
@ 2011-09-30 21:40             ` Eric B Munson
  0 siblings, 0 replies; 67+ messages in thread
From: Eric B Munson @ 2011-09-30 21:40 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

 On Fri, 30 Sep 2011 14:16:25 -0700, Michel Lespinasse wrote:
> On Fri, Sep 30, 2011 at 11:19 AM, Eric B Munson <emunson@mgebm.net> 
> wrote:
>> I am able to recreate on a second desktop I have here (same model 
>> CPU but a
>> different MB so I am fairly sure it isn't dying hardware). A It looks 
>> to me like
>> a CPU softlocks and it stalls the process active there, so most 
>> recently that
>> was XOrg. A The machine lets me login via ssh for a few minutes, but 
>> things like
>> ps and cat or /proc files will start to work and give some output 
>> but hang.
>> I cannot call reboot, nor can I sync the fs and reboot via SysRq. 
>> A My next step
>> is to setup a netconsole to see if anything comes out in the syslog 
>> that I
>> cannot see.
>
> I haven't had time to try & reproduce locally yet (apologies - things
> have been coming up at me).
>
> But a prime suspect would be a bad interaction with
> CONFIG_MEMORY_HOTPLUG, as Kamezama remarked in his reply to patch 4. 
> I
> think this could be the most likely cause of what you're observing.

 I will try disabling Memory Hotplug on Monday and let you know if that 
 helps.

 Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/9] V2: idle page tracking / working set estimation
  2011-09-30 21:16           ` Michel Lespinasse
  (?)
  (?)
@ 2011-10-03 15:06           ` Eric B Munson
  -1 siblings, 0 replies; 67+ messages in thread
From: Eric B Munson @ 2011-10-03 15:06 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

[-- Attachment #1: Type: text/plain, Size: 1196 bytes --]

On Fri, 30 Sep 2011, Michel Lespinasse wrote:

> On Fri, Sep 30, 2011 at 11:19 AM, Eric B Munson <emunson@mgebm.net> wrote:
> > I am able to recreate on a second desktop I have here (same model CPU but a
> > different MB so I am fairly sure it isn't dying hardware).  It looks to me like
> > a CPU softlocks and it stalls the process active there, so most recently that
> > was XOrg.  The machine lets me login via ssh for a few minutes, but things like
> > ps and cat or /proc files will start to work and give some output but hang.
> > I cannot call reboot, nor can I sync the fs and reboot via SysRq.  My next step
> > is to setup a netconsole to see if anything comes out in the syslog that I
> > cannot see.
> 
> I haven't had time to try & reproduce locally yet (apologies - things
> have been coming up at me).
> 
> But a prime suspect would be a bad interaction with
> CONFIG_MEMORY_HOTPLUG, as Kamezama remarked in his reply to patch 4. I
> think this could be the most likely cause of what you're observing.
> 

CONFIG_MEMORY_HOTPLUG seems to be the responsible party here, I disabled it in
my config and have been able to build 6 kernels straight without a hang.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
  2011-09-28  8:59         ` KAMEZAWA Hiroyuki
@ 2011-10-14  1:25           ` Michel Lespinasse
  -1 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-10-14  1:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, Sep 28, 2011 at 1:59 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 28 Sep 2011 01:19:50 -0700
> Michel Lespinasse <walken@google.com> wrote:
>> It tends to perform worse if we try making it multithreaded. What
>> happens is that the scanning threads call page_referenced() a lot, and
>> if they both try scanning pages that belong to the same file that
>> causes the mapping's i_mmap_mutex lock to bounce. Same things happens
>> if they try scanning pages that belong to the same anon VMA too.
>>
>
> Hmm. with brief thinking, if you can scan list of page tables,
> you can set young flags without any locks.
> For inode pages, you can hook page lookup, I think.

It would be possible to avoid taking rmap locks by instead scanning
all page tables, and transferring the pte young bits observed there to
the PageYoung page flag. This is a significant design change, but
would indeed work.

Just to clarify the idea, how would you go about finding all page
tables to scan ? The most straightforward approach would be iterate
over all processes and scan their address spaces, but I don't think we
can afford to hold tasklist_lock (even for reads) for so long, so we'd
have to be a bit smarter than that... I can think of a few different
ways but I'd like to know if you have something specific in mind
first.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
@ 2011-10-14  1:25           ` Michel Lespinasse
  0 siblings, 0 replies; 67+ messages in thread
From: Michel Lespinasse @ 2011-10-14  1:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Wed, Sep 28, 2011 at 1:59 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 28 Sep 2011 01:19:50 -0700
> Michel Lespinasse <walken@google.com> wrote:
>> It tends to perform worse if we try making it multithreaded. What
>> happens is that the scanning threads call page_referenced() a lot, and
>> if they both try scanning pages that belong to the same file that
>> causes the mapping's i_mmap_mutex lock to bounce. Same things happens
>> if they try scanning pages that belong to the same anon VMA too.
>>
>
> Hmm. with brief thinking, if you can scan list of page tables,
> you can set young flags without any locks.
> For inode pages, you can hook page lookup, I think.

It would be possible to avoid taking rmap locks by instead scanning
all page tables, and transferring the pte young bits observed there to
the PageYoung page flag. This is a significant design change, but
would indeed work.

Just to clarify the idea, how would you go about finding all page
tables to scan ? The most straightforward approach would be iterate
over all processes and scan their address spaces, but I don't think we
can afford to hold tasklist_lock (even for reads) for so long, so we'd
have to be a bit smarter than that... I can think of a few different
ways but I'd like to know if you have something specific in mind
first.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
  2011-10-14  1:25           ` Michel Lespinasse
@ 2011-10-14  4:54             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-10-14  4:54 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Thu, 13 Oct 2011 18:25:06 -0700
Michel Lespinasse <walken@google.com> wrote:

> On Wed, Sep 28, 2011 at 1:59 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 28 Sep 2011 01:19:50 -0700
> > Michel Lespinasse <walken@google.com> wrote:
> >> It tends to perform worse if we try making it multithreaded. What
> >> happens is that the scanning threads call page_referenced() a lot, and
> >> if they both try scanning pages that belong to the same file that
> >> causes the mapping's i_mmap_mutex lock to bounce. Same things happens
> >> if they try scanning pages that belong to the same anon VMA too.
> >>
> >
> > Hmm. with brief thinking, if you can scan list of page tables,
> > you can set young flags without any locks.
> > For inode pages, you can hook page lookup, I think.
> 
> It would be possible to avoid taking rmap locks by instead scanning
> all page tables, and transferring the pte young bits observed there to
> the PageYoung page flag. This is a significant design change, but
> would indeed work.
> 
> Just to clarify the idea, how would you go about finding all page
> tables to scan ? The most straightforward approach would be iterate
> over all processes and scan their address spaces, but I don't think we
> can afford to hold tasklist_lock (even for reads) for so long, so we'd
> have to be a bit smarter than that... I can think of a few different
> ways but I'd like to know if you have something specific in mind
> first.

Maybe there are several idea. 

1. how about chasing "pgd" kmem_cache ?
   I'm not sure but in x86 it seems all pgds are lined to pgd_list.
   Now, it's not RCU list but making it as RCU list isn't hard.
   Note: IIUC, struct page for pgd contains pointer to mm_struct.

2. track dup_mm and do_exec.
   insert hook and maintain list of mm_struct.(It's not needed to be
   implemented as list)

3. Like pgd_list, add some flag to pgd pages. Then, you can scan memmap
   and find 'pgd' page and walk into the page table tree.

Hmm ?

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/9] kstaled: rate limit pages scanned per second.
@ 2011-10-14  4:54             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 67+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-10-14  4:54 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dave Hansen, Rik van Riel,
	Balbir Singh, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	KOSAKI Motohiro, Hugh Dickins, Michael Wolf

On Thu, 13 Oct 2011 18:25:06 -0700
Michel Lespinasse <walken@google.com> wrote:

> On Wed, Sep 28, 2011 at 1:59 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 28 Sep 2011 01:19:50 -0700
> > Michel Lespinasse <walken@google.com> wrote:
> >> It tends to perform worse if we try making it multithreaded. What
> >> happens is that the scanning threads call page_referenced() a lot, and
> >> if they both try scanning pages that belong to the same file that
> >> causes the mapping's i_mmap_mutex lock to bounce. Same things happens
> >> if they try scanning pages that belong to the same anon VMA too.
> >>
> >
> > Hmm. with brief thinking, if you can scan list of page tables,
> > you can set young flags without any locks.
> > For inode pages, you can hook page lookup, I think.
> 
> It would be possible to avoid taking rmap locks by instead scanning
> all page tables, and transferring the pte young bits observed there to
> the PageYoung page flag. This is a significant design change, but
> would indeed work.
> 
> Just to clarify the idea, how would you go about finding all page
> tables to scan ? The most straightforward approach would be iterate
> over all processes and scan their address spaces, but I don't think we
> can afford to hold tasklist_lock (even for reads) for so long, so we'd
> have to be a bit smarter than that... I can think of a few different
> ways but I'd like to know if you have something specific in mind
> first.

Maybe there are several idea. 

1. how about chasing "pgd" kmem_cache ?
   I'm not sure but in x86 it seems all pgds are lined to pgd_list.
   Now, it's not RCU list but making it as RCU list isn't hard.
   Note: IIUC, struct page for pgd contains pointer to mm_struct.

2. track dup_mm and do_exec.
   insert hook and maintain list of mm_struct.(It's not needed to be
   implemented as list)

3. Like pgd_list, add some flag to pgd pages. Then, you can scan memmap
   and find 'pgd' page and walk into the page table tree.

Hmm ?

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
  2011-09-28  0:49   ` Michel Lespinasse
@ 2012-02-20  9:17     ` Zhu Yanhai
  -1 siblings, 0 replies; 67+ messages in thread
From: Zhu Yanhai @ 2012-02-20  9:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 14719 bytes --]

2011/9/28 Michel Lespinasse <walken@google.com>:
> Introduce minimal kstaled implementation. The scan rate is controlled by
> /sys/kernel/mm/kstaled/scan_seconds and per-cgroup statistics are output
> into /dev/cgroup/*/memory.idle_page_stats.
>
>
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  mm/memcontrol.c |  297 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 297 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..e55056f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -49,6 +49,8 @@
>  #include <linux/page_cgroup.h>
>  #include <linux/cpu.h>
>  #include <linux/oom.h>
> +#include <linux/kthread.h>
> +#include <linux/rmap.h>
>  #include "internal.h"
>
>  #include <asm/uaccess.h>
> @@ -283,6 +285,16 @@ struct mem_cgroup {
>         */
>        struct mem_cgroup_stat_cpu nocpu_base;
>        spinlock_t pcp_counter_lock;
> +
> +#ifdef CONFIG_KSTALED
> +       seqcount_t idle_page_stats_lock;
> +       struct idle_page_stats {
> +               unsigned long idle_clean;
> +               unsigned long idle_dirty_file;
> +               unsigned long idle_dirty_swap;
> +       } idle_page_stats, idle_scan_stats;
> +       unsigned long idle_page_scans;
> +#endif
>  };
>
>  /* Stuffs for move charges at task migration. */
> @@ -4668,6 +4680,30 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
>  }
>  #endif /* CONFIG_NUMA */
>
> +#ifdef CONFIG_KSTALED
> +static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
> +       struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       unsigned int seqcount;
> +       struct idle_page_stats stats;
> +       unsigned long scans;
> +
> +       do {
> +               seqcount = read_seqcount_begin(&memcg->idle_page_stats_lock);
> +               stats = memcg->idle_page_stats;
> +               scans = memcg->idle_page_scans;
> +       } while (read_seqcount_retry(&memcg->idle_page_stats_lock, seqcount));
> +
> +       cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
> +       cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
> +       cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
> +       cb->fill(cb, "scans", scans);
> +
> +       return 0;
> +}
> +#endif /* CONFIG_KSTALED */
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -4738,6 +4774,12 @@ static struct cftype mem_cgroup_files[] = {
>                .mode = S_IRUGO,
>        },
>  #endif
> +#ifdef CONFIG_KSTALED
> +       {
> +               .name = "idle_page_stats",
> +               .read_map = mem_cgroup_idle_page_stats_read,
> +       },
> +#endif
>  };
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -5001,6 +5043,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> +#ifdef CONFIG_KSTALED
> +       seqcount_init(&mem->idle_page_stats_lock);
> +#endif
>        return &mem->css;
>  free_out:
>        __mem_cgroup_free(mem);
> @@ -5568,3 +5613,255 @@ static int __init enable_swap_account(char *s)
>  __setup("swapaccount=", enable_swap_account);
>
>  #endif
> +
> +#ifdef CONFIG_KSTALED
> +
> +static unsigned int kstaled_scan_seconds;
> +static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
> +
> +static unsigned kstaled_scan_page(struct page *page)
> +{
> +       bool is_locked = false;
> +       bool is_file;
> +       struct page_referenced_info info;
> +       struct page_cgroup *pc;
> +       struct idle_page_stats *stats;
> +       unsigned nr_pages;
> +
> +       /*
> +        * Before taking the page reference, check if the page is
> +        * a user page which is not obviously unreclaimable
> +        * (we will do more complete checks later).
> +        */
> +       if (!PageLRU(page) ||
> +           (!PageCompound(page) &&
> +            (PageMlocked(page) ||
> +             (page->mapping == NULL && !PageSwapCache(page)))))
> +               return 1;
> +
> +       if (!get_page_unless_zero(page))
> +               return 1;
> +
> +       /* Recheck now that we have the page reference. */
> +       if (unlikely(!PageLRU(page)))
> +               goto out;
> +       nr_pages = 1 << compound_trans_order(page);
> +       if (PageMlocked(page))
> +               goto out;
> +
> +       /*
> +        * Anon and SwapCache pages can be identified without locking.
> +        * For all other cases, we need the page locked in order to
> +        * dereference page->mapping.
> +        */
> +       if (PageAnon(page) || PageSwapCache(page))
> +               is_file = false;
> +       else if (!trylock_page(page)) {
> +               /*
> +                * We need to lock the page to dereference the mapping.
> +                * But don't risk sleeping by calling lock_page().
> +                * We don't want to stall kstaled, so we conservatively
> +                * count locked pages as unreclaimable.
> +                */
> +               goto out;
> +       } else {
> +               struct address_space *mapping = page->mapping;
> +
> +               is_locked = true;
> +
> +               /*
> +                * The page is still anon - it has been continuously referenced
> +                * since the prior check.
> +                */
> +               VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
> +
> +               /*
> +                * Check the mapping under protection of the page lock.
> +                * 1. If the page is not swap cache and has no mapping,
> +                *    shrink_page_list can't do anything with it.
> +                * 2. If the mapping is unevictable (as in SHM_LOCK segments),
> +                *    shrink_page_list can't do anything with it.
> +                * 3. If the page is swap cache or the mapping is swap backed
> +                *    (as in shmem), consider it a swappable page.
> +                * 4. If the backing dev has indicated that it does not want
> +                *    its pages sync'd to disk (as in ramfs), take this as
> +                *    a hint that its pages are not reclaimable.
> +                * 5. Otherwise, consider this as a file page reclaimable
> +                *    through standard pageout.
> +                */
> +               if (!mapping && !PageSwapCache(page))
> +                       goto out;
> +               else if (mapping_unevictable(mapping))
> +                       goto out;
> +               else if (PageSwapCache(page) ||
> +                        mapping_cap_swap_backed(mapping))
> +                       is_file = false;
> +               else if (!mapping_cap_writeback_dirty(mapping))
> +                       goto out;
> +               else
> +                       is_file = true;
> +       }
> +
> +       /* Find out if the page is idle. Also test for pending mlock. */
> +       page_referenced_kstaled(page, is_locked, &info);
> +       if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
> +               goto out;
> +
> +       /* Locate kstaled stats for the page's cgroup. */
> +       pc = lookup_page_cgroup(page);
> +       if (!pc)
> +               goto out;
> +       lock_page_cgroup(pc);
> +       if (!PageCgroupUsed(pc))
> +               goto unlock_page_cgroup_out;
> +       stats = &pc->mem_cgroup->idle_scan_stats;
Is it safe to deference it like this? I think we need something like this:
struct mem_cgroup *memcg = pc->mem_cgroup;
if (!memcg || !css_tryget(&memcg->css))
   goto out;
And also css_put() in soewhere bmelow.
Or simply remove the lock_page_cgroup() above and use
try_get_mem_cgroup_from_page() directly.

--
Thanks,
Zhu Yanhai
> +
> +       /* Finally increment the correct statistic for this page. */
> +       if (!(info.pr_flags & PR_DIRTY) &&
> +           !PageDirty(page) && !PageWriteback(page))
> +               stats->idle_clean += nr_pages;
> +       else if (is_file)
> +               stats->idle_dirty_file += nr_pages;
> +       else
> +               stats->idle_dirty_swap += nr_pages;
> +
> + unlock_page_cgroup_out:
> +       unlock_page_cgroup(pc);
> +
> + out:
> +       if (is_locked)
> +               unlock_page(page);
> +       put_page(page);
> +
> +       return nr_pages;
> +}
> +
> +static void kstaled_scan_node(pg_data_t *pgdat)
> +{
> +       unsigned long flags;
> +       unsigned long pfn, end;
> +
> +       pgdat_resize_lock(pgdat, &flags);
> +
> +       pfn = pgdat->node_start_pfn;
> +       end = pfn + pgdat->node_spanned_pages;
> +
> +       while (pfn < end) {
> +               if (need_resched()) {
> +                       pgdat_resize_unlock(pgdat, &flags);
> +                       cond_resched();
> +                       pgdat_resize_lock(pgdat, &flags);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +                       /* abort if the node got resized */
> +                       if (pfn < pgdat->node_start_pfn ||
> +                           end > (pgdat->node_start_pfn +
> +                                  pgdat->node_spanned_pages))
> +                               goto abort;
> +#endif
> +               }
> +
> +               pfn += pfn_valid(pfn) ?
> +                       kstaled_scan_page(pfn_to_page(pfn)) : 1;
> +       }
> +
> +abort:
> +       pgdat_resize_unlock(pgdat, &flags);
> +}
> +
> +static int kstaled(void *dummy)
> +{
> +       while (1) {
> +               int scan_seconds;
> +               int nid;
> +               struct mem_cgroup *memcg;
> +
> +               wait_event_interruptible(kstaled_wait,
> +                                (scan_seconds = kstaled_scan_seconds) > 0);
> +               /*
> +                * We use interruptible wait_event so as not to contribute
> +                * to the machine load average while we're sleeping.
> +                * However, we don't actually expect to receive a signal
> +                * since we run as a kernel thread, so the condition we were
> +                * waiting for should be true once we get here.
> +                */
> +               BUG_ON(scan_seconds <= 0);
> +
> +               for_each_mem_cgroup_all(memcg)
> +                       memset(&memcg->idle_scan_stats, 0,
> +                              sizeof(memcg->idle_scan_stats));
> +
> +               for_each_node_state(nid, N_HIGH_MEMORY)
> +                       kstaled_scan_node(NODE_DATA(nid));
> +
> +               for_each_mem_cgroup_all(memcg) {
> +                       write_seqcount_begin(&memcg->idle_page_stats_lock);
> +                       memcg->idle_page_stats = memcg->idle_scan_stats;
> +                       memcg->idle_page_scans++;
> +                       write_seqcount_end(&memcg->idle_page_stats_lock);
> +               }
> +
> +               schedule_timeout_interruptible(scan_seconds * HZ);
> +       }
> +
> +       BUG();
> +       return 0;       /* NOT REACHED */
> +}
> +
> +static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
> +                                        struct kobj_attribute *attr,
> +                                        char *buf)
> +{
> +       return sprintf(buf, "%u\n", kstaled_scan_seconds);
> +}
> +
> +static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
> +                                         struct kobj_attribute *attr,
> +                                         const char *buf, size_t count)
> +{
> +       int err;
> +       unsigned long input;
> +
> +       err = kstrtoul(buf, 10, &input);
> +       if (err)
> +               return -EINVAL;
> +       kstaled_scan_seconds = input;
> +       wake_up_interruptible(&kstaled_wait);
> +       return count;
> +}
> +
> +static struct kobj_attribute kstaled_scan_seconds_attr = __ATTR(
> +       scan_seconds, 0644,
> +       kstaled_scan_seconds_show, kstaled_scan_seconds_store);
> +
> +static struct attribute *kstaled_attrs[] = {
> +       &kstaled_scan_seconds_attr.attr,
> +       NULL
> +};
> +static struct attribute_group kstaled_attr_group = {
> +       .name = "kstaled",
> +       .attrs = kstaled_attrs,
> +};
> +
> +static int __init kstaled_init(void)
> +{
> +       int error;
> +       struct task_struct *thread;
> +
> +       error = sysfs_create_group(mm_kobj, &kstaled_attr_group);
> +       if (error) {
> +               pr_err("Failed to create kstaled sysfs node\n");
> +               return error;
> +       }
> +
> +       thread = kthread_run(kstaled, NULL, "kstaled");
> +       if (IS_ERR(thread)) {
> +               pr_err("Failed to start kstaled\n");
> +               return PTR_ERR(thread);
> +       }
> +
> +       return 0;
> +}
> +module_init(kstaled_init);
> +
> +#endif /* CONFIG_KSTALED */
> --
> 1.7.3.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/9] kstaled: minimalistic implementation.
@ 2012-02-20  9:17     ` Zhu Yanhai
  0 siblings, 0 replies; 67+ messages in thread
From: Zhu Yanhai @ 2012-02-20  9:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Dave Hansen, Rik van Riel, Balbir Singh, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, KOSAKI Motohiro, Hugh Dickins,
	Michael Wolf

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 14707 bytes --]

2011/9/28 Michel Lespinasse <walken@google.com>:
> Introduce minimal kstaled implementation. The scan rate is controlled by
> /sys/kernel/mm/kstaled/scan_seconds and per-cgroup statistics are output
> into /dev/cgroup/*/memory.idle_page_stats.
>
>
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  mm/memcontrol.c |  297 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 297 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..e55056f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -49,6 +49,8 @@
>  #include <linux/page_cgroup.h>
>  #include <linux/cpu.h>
>  #include <linux/oom.h>
> +#include <linux/kthread.h>
> +#include <linux/rmap.h>
>  #include "internal.h"
>
>  #include <asm/uaccess.h>
> @@ -283,6 +285,16 @@ struct mem_cgroup {
>         */
>        struct mem_cgroup_stat_cpu nocpu_base;
>        spinlock_t pcp_counter_lock;
> +
> +#ifdef CONFIG_KSTALED
> +       seqcount_t idle_page_stats_lock;
> +       struct idle_page_stats {
> +               unsigned long idle_clean;
> +               unsigned long idle_dirty_file;
> +               unsigned long idle_dirty_swap;
> +       } idle_page_stats, idle_scan_stats;
> +       unsigned long idle_page_scans;
> +#endif
>  };
>
>  /* Stuffs for move charges at task migration. */
> @@ -4668,6 +4680,30 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
>  }
>  #endif /* CONFIG_NUMA */
>
> +#ifdef CONFIG_KSTALED
> +static int mem_cgroup_idle_page_stats_read(struct cgroup *cgrp,
> +       struct cftype *cft,  struct cgroup_map_cb *cb)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       unsigned int seqcount;
> +       struct idle_page_stats stats;
> +       unsigned long scans;
> +
> +       do {
> +               seqcount = read_seqcount_begin(&memcg->idle_page_stats_lock);
> +               stats = memcg->idle_page_stats;
> +               scans = memcg->idle_page_scans;
> +       } while (read_seqcount_retry(&memcg->idle_page_stats_lock, seqcount));
> +
> +       cb->fill(cb, "idle_clean", stats.idle_clean * PAGE_SIZE);
> +       cb->fill(cb, "idle_dirty_file", stats.idle_dirty_file * PAGE_SIZE);
> +       cb->fill(cb, "idle_dirty_swap", stats.idle_dirty_swap * PAGE_SIZE);
> +       cb->fill(cb, "scans", scans);
> +
> +       return 0;
> +}
> +#endif /* CONFIG_KSTALED */
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -4738,6 +4774,12 @@ static struct cftype mem_cgroup_files[] = {
>                .mode = S_IRUGO,
>        },
>  #endif
> +#ifdef CONFIG_KSTALED
> +       {
> +               .name = "idle_page_stats",
> +               .read_map = mem_cgroup_idle_page_stats_read,
> +       },
> +#endif
>  };
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -5001,6 +5043,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> +#ifdef CONFIG_KSTALED
> +       seqcount_init(&mem->idle_page_stats_lock);
> +#endif
>        return &mem->css;
>  free_out:
>        __mem_cgroup_free(mem);
> @@ -5568,3 +5613,255 @@ static int __init enable_swap_account(char *s)
>  __setup("swapaccount=", enable_swap_account);
>
>  #endif
> +
> +#ifdef CONFIG_KSTALED
> +
> +static unsigned int kstaled_scan_seconds;
> +static DECLARE_WAIT_QUEUE_HEAD(kstaled_wait);
> +
> +static unsigned kstaled_scan_page(struct page *page)
> +{
> +       bool is_locked = false;
> +       bool is_file;
> +       struct page_referenced_info info;
> +       struct page_cgroup *pc;
> +       struct idle_page_stats *stats;
> +       unsigned nr_pages;
> +
> +       /*
> +        * Before taking the page reference, check if the page is
> +        * a user page which is not obviously unreclaimable
> +        * (we will do more complete checks later).
> +        */
> +       if (!PageLRU(page) ||
> +           (!PageCompound(page) &&
> +            (PageMlocked(page) ||
> +             (page->mapping == NULL && !PageSwapCache(page)))))
> +               return 1;
> +
> +       if (!get_page_unless_zero(page))
> +               return 1;
> +
> +       /* Recheck now that we have the page reference. */
> +       if (unlikely(!PageLRU(page)))
> +               goto out;
> +       nr_pages = 1 << compound_trans_order(page);
> +       if (PageMlocked(page))
> +               goto out;
> +
> +       /*
> +        * Anon and SwapCache pages can be identified without locking.
> +        * For all other cases, we need the page locked in order to
> +        * dereference page->mapping.
> +        */
> +       if (PageAnon(page) || PageSwapCache(page))
> +               is_file = false;
> +       else if (!trylock_page(page)) {
> +               /*
> +                * We need to lock the page to dereference the mapping.
> +                * But don't risk sleeping by calling lock_page().
> +                * We don't want to stall kstaled, so we conservatively
> +                * count locked pages as unreclaimable.
> +                */
> +               goto out;
> +       } else {
> +               struct address_space *mapping = page->mapping;
> +
> +               is_locked = true;
> +
> +               /*
> +                * The page is still anon - it has been continuously referenced
> +                * since the prior check.
> +                */
> +               VM_BUG_ON(PageAnon(page) || mapping != page_rmapping(page));
> +
> +               /*
> +                * Check the mapping under protection of the page lock.
> +                * 1. If the page is not swap cache and has no mapping,
> +                *    shrink_page_list can't do anything with it.
> +                * 2. If the mapping is unevictable (as in SHM_LOCK segments),
> +                *    shrink_page_list can't do anything with it.
> +                * 3. If the page is swap cache or the mapping is swap backed
> +                *    (as in shmem), consider it a swappable page.
> +                * 4. If the backing dev has indicated that it does not want
> +                *    its pages sync'd to disk (as in ramfs), take this as
> +                *    a hint that its pages are not reclaimable.
> +                * 5. Otherwise, consider this as a file page reclaimable
> +                *    through standard pageout.
> +                */
> +               if (!mapping && !PageSwapCache(page))
> +                       goto out;
> +               else if (mapping_unevictable(mapping))
> +                       goto out;
> +               else if (PageSwapCache(page) ||
> +                        mapping_cap_swap_backed(mapping))
> +                       is_file = false;
> +               else if (!mapping_cap_writeback_dirty(mapping))
> +                       goto out;
> +               else
> +                       is_file = true;
> +       }
> +
> +       /* Find out if the page is idle. Also test for pending mlock. */
> +       page_referenced_kstaled(page, is_locked, &info);
> +       if ((info.pr_flags & PR_REFERENCED) || (info.vm_flags & VM_LOCKED))
> +               goto out;
> +
> +       /* Locate kstaled stats for the page's cgroup. */
> +       pc = lookup_page_cgroup(page);
> +       if (!pc)
> +               goto out;
> +       lock_page_cgroup(pc);
> +       if (!PageCgroupUsed(pc))
> +               goto unlock_page_cgroup_out;
> +       stats = &pc->mem_cgroup->idle_scan_stats;
Is it safe to deference it like this? I think we need something like this:
struct mem_cgroup *memcg = pc->mem_cgroup;
if (!memcg || !css_tryget(&memcg->css))
   goto out;
And also css_put() in soewhere bmelow.
Or simply remove the lock_page_cgroup() above and use
try_get_mem_cgroup_from_page() directly.

--
Thanks,
Zhu Yanhai
> +
> +       /* Finally increment the correct statistic for this page. */
> +       if (!(info.pr_flags & PR_DIRTY) &&
> +           !PageDirty(page) && !PageWriteback(page))
> +               stats->idle_clean += nr_pages;
> +       else if (is_file)
> +               stats->idle_dirty_file += nr_pages;
> +       else
> +               stats->idle_dirty_swap += nr_pages;
> +
> + unlock_page_cgroup_out:
> +       unlock_page_cgroup(pc);
> +
> + out:
> +       if (is_locked)
> +               unlock_page(page);
> +       put_page(page);
> +
> +       return nr_pages;
> +}
> +
> +static void kstaled_scan_node(pg_data_t *pgdat)
> +{
> +       unsigned long flags;
> +       unsigned long pfn, end;
> +
> +       pgdat_resize_lock(pgdat, &flags);
> +
> +       pfn = pgdat->node_start_pfn;
> +       end = pfn + pgdat->node_spanned_pages;
> +
> +       while (pfn < end) {
> +               if (need_resched()) {
> +                       pgdat_resize_unlock(pgdat, &flags);
> +                       cond_resched();
> +                       pgdat_resize_lock(pgdat, &flags);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +                       /* abort if the node got resized */
> +                       if (pfn < pgdat->node_start_pfn ||
> +                           end > (pgdat->node_start_pfn +
> +                                  pgdat->node_spanned_pages))
> +                               goto abort;
> +#endif
> +               }
> +
> +               pfn += pfn_valid(pfn) ?
> +                       kstaled_scan_page(pfn_to_page(pfn)) : 1;
> +       }
> +
> +abort:
> +       pgdat_resize_unlock(pgdat, &flags);
> +}
> +
> +static int kstaled(void *dummy)
> +{
> +       while (1) {
> +               int scan_seconds;
> +               int nid;
> +               struct mem_cgroup *memcg;
> +
> +               wait_event_interruptible(kstaled_wait,
> +                                (scan_seconds = kstaled_scan_seconds) > 0);
> +               /*
> +                * We use interruptible wait_event so as not to contribute
> +                * to the machine load average while we're sleeping.
> +                * However, we don't actually expect to receive a signal
> +                * since we run as a kernel thread, so the condition we were
> +                * waiting for should be true once we get here.
> +                */
> +               BUG_ON(scan_seconds <= 0);
> +
> +               for_each_mem_cgroup_all(memcg)
> +                       memset(&memcg->idle_scan_stats, 0,
> +                              sizeof(memcg->idle_scan_stats));
> +
> +               for_each_node_state(nid, N_HIGH_MEMORY)
> +                       kstaled_scan_node(NODE_DATA(nid));
> +
> +               for_each_mem_cgroup_all(memcg) {
> +                       write_seqcount_begin(&memcg->idle_page_stats_lock);
> +                       memcg->idle_page_stats = memcg->idle_scan_stats;
> +                       memcg->idle_page_scans++;
> +                       write_seqcount_end(&memcg->idle_page_stats_lock);
> +               }
> +
> +               schedule_timeout_interruptible(scan_seconds * HZ);
> +       }
> +
> +       BUG();
> +       return 0;       /* NOT REACHED */
> +}
> +
> +static ssize_t kstaled_scan_seconds_show(struct kobject *kobj,
> +                                        struct kobj_attribute *attr,
> +                                        char *buf)
> +{
> +       return sprintf(buf, "%u\n", kstaled_scan_seconds);
> +}
> +
> +static ssize_t kstaled_scan_seconds_store(struct kobject *kobj,
> +                                         struct kobj_attribute *attr,
> +                                         const char *buf, size_t count)
> +{
> +       int err;
> +       unsigned long input;
> +
> +       err = kstrtoul(buf, 10, &input);
> +       if (err)
> +               return -EINVAL;
> +       kstaled_scan_seconds = input;
> +       wake_up_interruptible(&kstaled_wait);
> +       return count;
> +}
> +
> +static struct kobj_attribute kstaled_scan_seconds_attr = __ATTR(
> +       scan_seconds, 0644,
> +       kstaled_scan_seconds_show, kstaled_scan_seconds_store);
> +
> +static struct attribute *kstaled_attrs[] = {
> +       &kstaled_scan_seconds_attr.attr,
> +       NULL
> +};
> +static struct attribute_group kstaled_attr_group = {
> +       .name = "kstaled",
> +       .attrs = kstaled_attrs,
> +};
> +
> +static int __init kstaled_init(void)
> +{
> +       int error;
> +       struct task_struct *thread;
> +
> +       error = sysfs_create_group(mm_kobj, &kstaled_attr_group);
> +       if (error) {
> +               pr_err("Failed to create kstaled sysfs node\n");
> +               return error;
> +       }
> +
> +       thread = kthread_run(kstaled, NULL, "kstaled");
> +       if (IS_ERR(thread)) {
> +               pr_err("Failed to start kstaled\n");
> +               return PTR_ERR(thread);
> +       }
> +
> +       return 0;
> +}
> +module_init(kstaled_init);
> +
> +#endif /* CONFIG_KSTALED */
> --
> 1.7.3.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒñb‚^[nö¢®×¥yÊ&Š{^®w­r\x16«ë"œ&§iÖ¬Š	á¶Ú\x7fþËh¦Ø^™ë^­Æ¿\x0e‰ízf¢•¨ky

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2012-02-20  9:18 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-28  0:48 [PATCH 0/9] V2: idle page tracking / working set estimation Michel Lespinasse
2011-09-28  0:48 ` Michel Lespinasse
2011-09-28  0:48 ` [PATCH 1/9] page_referenced: replace vm_flags parameter with struct page_referenced_info Michel Lespinasse
2011-09-28  0:48   ` Michel Lespinasse
2011-09-28  6:28   ` KAMEZAWA Hiroyuki
2011-09-28  6:28     ` KAMEZAWA Hiroyuki
2011-09-28  0:49 ` [PATCH 2/9] kstaled: documentation and config option Michel Lespinasse
2011-09-28  0:49   ` Michel Lespinasse
2011-09-28  6:53   ` KAMEZAWA Hiroyuki
2011-09-28  6:53     ` KAMEZAWA Hiroyuki
2011-09-28 23:48     ` Michel Lespinasse
2011-09-28 23:48       ` Michel Lespinasse
2011-09-29  5:40       ` KAMEZAWA Hiroyuki
2011-09-29  5:40         ` KAMEZAWA Hiroyuki
2011-09-28  0:49 ` [PATCH 3/9] kstaled: page_referenced_kstaled() and supporting infrastructure Michel Lespinasse
2011-09-28  0:49   ` Michel Lespinasse
2011-09-28  7:18   ` KAMEZAWA Hiroyuki
2011-09-28  7:18     ` KAMEZAWA Hiroyuki
2011-09-29  0:09     ` Michel Lespinasse
2011-09-29  0:09       ` Michel Lespinasse
2011-09-28  0:49 ` [PATCH 4/9] kstaled: minimalistic implementation Michel Lespinasse
2011-09-28  0:49   ` Michel Lespinasse
2011-09-28  7:41   ` Peter Zijlstra
2011-09-28  7:41     ` Peter Zijlstra
2011-09-28  8:01     ` Michel Lespinasse
2011-09-28  8:01       ` Michel Lespinasse
2011-09-28 10:26       ` Peter Zijlstra
2011-09-28 10:26         ` Peter Zijlstra
2011-09-28  8:00   ` KAMEZAWA Hiroyuki
2011-09-28  8:00     ` KAMEZAWA Hiroyuki
2012-02-20  9:17   ` Zhu Yanhai
2012-02-20  9:17     ` Zhu Yanhai
2011-09-28  0:49 ` [PATCH 5/9] kstaled: skip non-RAM regions Michel Lespinasse
2011-09-28  0:49   ` Michel Lespinasse
2011-09-28  8:03   ` KAMEZAWA Hiroyuki
2011-09-28  8:03     ` KAMEZAWA Hiroyuki
2011-09-28  0:49 ` [PATCH 6/9] kstaled: rate limit pages scanned per second Michel Lespinasse
2011-09-28  0:49   ` Michel Lespinasse
2011-09-28  8:13   ` KAMEZAWA Hiroyuki
2011-09-28  8:13     ` KAMEZAWA Hiroyuki
2011-09-28  8:19     ` Michel Lespinasse
2011-09-28  8:19       ` Michel Lespinasse
2011-09-28  8:59       ` KAMEZAWA Hiroyuki
2011-09-28  8:59         ` KAMEZAWA Hiroyuki
2011-10-14  1:25         ` Michel Lespinasse
2011-10-14  1:25           ` Michel Lespinasse
2011-10-14  4:54           ` KAMEZAWA Hiroyuki
2011-10-14  4:54             ` KAMEZAWA Hiroyuki
2011-09-28  0:49 ` [PATCH 7/9] kstaled: add histogram sampling functionality Michel Lespinasse
2011-09-28  0:49   ` Michel Lespinasse
2011-09-28  8:22   ` KAMEZAWA Hiroyuki
2011-09-28  8:22     ` KAMEZAWA Hiroyuki
2011-09-28  0:49 ` [PATCH 8/9] kstaled: add incrementally updating stale page count Michel Lespinasse
2011-09-28  0:49   ` Michel Lespinasse
2011-09-28  0:49 ` [PATCH 9/9] kstaled: export PG_stale in /proc/kpageflags Michel Lespinasse
2011-09-28  0:49   ` Michel Lespinasse
2011-09-29 16:43 ` [PATCH 0/9] V2: idle page tracking / working set estimation Eric B Munson
2011-09-29 20:25   ` Michel Lespinasse
2011-09-29 20:25     ` Michel Lespinasse
2011-09-29 21:18     ` Eric B Munson
2011-09-29 21:18       ` Eric B Munson
2011-09-30 18:19       ` Eric B Munson
2011-09-30 21:16         ` Michel Lespinasse
2011-09-30 21:16           ` Michel Lespinasse
2011-09-30 21:40           ` Eric B Munson
2011-09-30 21:40             ` Eric B Munson
2011-10-03 15:06           ` Eric B Munson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.