linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [FIX][PATCH 0/3] memcg: 3 fixes for memory cgroup's memory reclaim
@ 2011-06-28  8:31 KAMEZAWA Hiroyuki
  2011-06-28  8:39 ` [PATCH 1/3] memcg: fix reclaimable lru check in memcg KAMEZAWA Hiroyuki
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-28  8:31 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, Michal Hocko, nishimura, linux-kernel


This series contains 3 fixes for memcg in Linus's git tree.
All of them were posted in the last week. I cut out and refreshed
and post it here because I think all pathces has obvious benfits, I think.

All of patches are independent from each other but you may see
some dependency between 1 and 2.

1/3 .... fix memory cgroup reclaimable check.
2/3 .... fix memory cgroup numascan update by events
3/3 .... fix lock_page() trouble when using memcg.

Because 3/3 is a patch to change behavior of __do_fault(), I'd like
to get review of mm specialists.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] memcg: fix reclaimable lru check in memcg.
  2011-06-28  8:31 [FIX][PATCH 0/3] memcg: 3 fixes for memory cgroup's memory reclaim KAMEZAWA Hiroyuki
@ 2011-06-28  8:39 ` KAMEZAWA Hiroyuki
  2011-06-29 13:40   ` Michal Hocko
  2011-06-28  8:41 ` [FIX][PATCH 2/3] memcg: fix numa scan information update to be triggered by memory event KAMEZAWA Hiroyuki
  2011-06-28  8:54 ` [PATCH 3/3] mm: preallocate page before lock_page() at filemap COW KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-28  8:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm, Michal Hocko, nishimura, linux-kernel

>From b52bcd09843e903e5f184d0ee499909d072f3c8d Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Tue, 28 Jun 2011 15:45:38 +0900
Subject: [PATCH 1/3] memcg: fix reclaimable lru check in memcg.

Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage()
is used for checking whether the memcg contains reclaimable pages
or not. If no pages in it, the routine skips it.

But, mem_cgroup_local_usage() contains Unevictable pages and cannot
handle "noswap" condition correctly. This doesn't work on a swapless
system.

This patch adds test_mem_cgroup_reclaimable() and replaces
mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU
counter and returns correct answer to the caller.
And this new function has "noswap" argument and can see only
FILE LRU if necessary.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changelug:
  - make use of mem_cgroup_node_nr_file_lru_pages()
  - fixed coding styles.
---
 mm/memcontrol.c |   83 ++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 64 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ddffc74..c624312 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -577,15 +577,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem,
 	return val;
 }
 
-static long mem_cgroup_local_usage(struct mem_cgroup *mem)
-{
-	long ret;
-
-	ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
-	ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
-	return ret;
-}
-
 static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 					 bool charge)
 {
@@ -1559,6 +1550,27 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
 	return ret;
 }
 
+/** test_mem_cgroup_node_reclaimable
+ * @mem: the target memcg
+ * @nid: the node ID to be checked.
+ * @noswap : specify true here if the user wants flle only information.
+ *
+ * This function returns whether the specified memcg contains any
+ * reclaimable pages on a node. Returns true if there are any reclaimable
+ * pages in the node.
+ */
+static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *mem,
+		int nid, bool noswap)
+{
+	if (mem_cgroup_node_nr_file_lru_pages(mem, nid))
+		return true;
+        if (noswap || !total_swap_pages)
+                return false;
+        if (mem_cgroup_node_nr_anon_lru_pages(mem, nid))
+                return true;
+        return false;
+
+}
 #if MAX_NUMNODES > 1
 
 /*
@@ -1580,15 +1592,8 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
 
 	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
 
-		if (mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_FILE) ||
-		    mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_FILE))
-			continue;
-
-		if (total_swap_pages &&
-		    (mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_ANON) ||
-		     mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_ANON)))
-			continue;
-		node_clear(nid, mem->scan_nodes);
+		if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
+			node_clear(nid, mem->scan_nodes);
 	}
 }
 
@@ -1627,11 +1632,51 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
 	return node;
 }
 
+/*
+ * Check all nodes whether it contains reclaimable pages or not.
+ * For quick scan, we make use of scan_nodes. This will allow us to skip
+ * unused nodes. But scan_nodes is lazily updated and may not cotain
+ * enough new information. We need to do double check.
+ */
+bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
+{
+	int nid;
+
+	/*
+	 * quick check...making use of scan_node.
+	 * We can skip unused nodes.
+	 */
+	if (!nodes_empty(mem->scan_nodes)) {
+		for (nid = first_node(mem->scan_nodes);
+		     nid < MAX_NUMNODES;
+		     nid = next_node(nid, mem->scan_nodes)) {
+
+			if (test_mem_cgroup_node_reclaimable(mem, nid, noswap))
+				return true;
+		}
+	}
+	/*
+ 	 * Check rest of nodes.
+ 	 */
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		if (node_isset(nid, mem->scan_nodes))
+			continue;
+		if (test_mem_cgroup_node_reclaimable(mem, nid, noswap))
+			return true;	
+	}
+	return false;
+}
+
 #else
 int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
 {
 	return 0;
 }
+
+bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
+{
+	return test_mem_cgroup_node_reclaimable(mem, 0, noswap);
+}
 #endif
 
 /*
@@ -1702,7 +1747,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 				}
 			}
 		}
-		if (!mem_cgroup_local_usage(victim)) {
+		if (!mem_cgroup_reclaimable(victim, noswap)) {
 			/* this cgroup's local usage == 0 */
 			css_put(&victim->css);
 			continue;
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [FIX][PATCH 2/3] memcg: fix numa scan information update to be triggered by memory event
  2011-06-28  8:31 [FIX][PATCH 0/3] memcg: 3 fixes for memory cgroup's memory reclaim KAMEZAWA Hiroyuki
  2011-06-28  8:39 ` [PATCH 1/3] memcg: fix reclaimable lru check in memcg KAMEZAWA Hiroyuki
@ 2011-06-28  8:41 ` KAMEZAWA Hiroyuki
  2011-06-29 13:12   ` Michal Hocko
  2011-06-28  8:54 ` [PATCH 3/3] mm: preallocate page before lock_page() at filemap COW KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-28  8:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm, Michal Hocko, nishimura, linux-kernel

>From 646ca5cd1e1ab0633892b86a1bbb6cf600d79d58 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Tue, 28 Jun 2011 17:09:25 +0900
Subject: [PATCH 2/3] Fix numa scan information update to be triggered by memory event

commit 889976 adds an numa node round-robin for memcg. But the information
is updated once per 10sec.

This patch changes the update trigger from jiffies to memcg's event count.
After this patch, numa scan information will be updated when we see
1024 events of pagein/pageout under a memcg.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changelog:
  - simplified
  - removed mutex
  - removed 3% check. To use heuristics, we cannot avoid magic value.
    So, removed heuristics.
---
 mm/memcontrol.c |   29 +++++++++++++++++++++++++----
 1 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c624312..3e7d5e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -108,10 +108,12 @@ enum mem_cgroup_events_index {
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
 	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_TARGET_NUMAINFO,
 	MEM_CGROUP_NTARGETS,
 };
 #define THRESHOLDS_EVENTS_TARGET (128)
 #define SOFTLIMIT_EVENTS_TARGET (1024)
+#define NUMAINFO_EVENTS_TARGET	(1024)
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -237,7 +239,8 @@ struct mem_cgroup {
 	int last_scanned_node;
 #if MAX_NUMNODES > 1
 	nodemask_t	scan_nodes;
-	unsigned long   next_scan_node_update;
+	atomic_t	numainfo_events;
+	atomic_t	numainfo_updating;
 #endif
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
@@ -680,6 +683,9 @@ static void __mem_cgroup_target_update(struct mem_cgroup *mem, int target)
 	case MEM_CGROUP_TARGET_SOFTLIMIT:
 		next = val + SOFTLIMIT_EVENTS_TARGET;
 		break;
+	case MEM_CGROUP_TARGET_NUMAINFO:
+		next = val + NUMAINFO_EVENTS_TARGET;
+		break;
 	default:
 		return;
 	}
@@ -703,6 +709,14 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
+#if MAX_NUMNODES > 1
+		if (unlikely(__memcg_event_check(mem,
+			MEM_CGROUP_TARGET_NUMAINFO))) {
+			atomic_inc(&mem->numainfo_events);
+			__mem_cgroup_target_update(mem,
+				MEM_CGROUP_TARGET_NUMAINFO);
+		}
+#endif
 	}
 }
 
@@ -1582,11 +1596,15 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *mem,
 static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
 {
 	int nid;
-
-	if (time_after(mem->next_scan_node_update, jiffies))
+	/*
+	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
+	 * pagein/pageout changes since the last update.
+	 */
+	if (!atomic_read(&mem->numainfo_events))
+		return;
+	if (atomic_inc_return(&mem->numainfo_updating) > 1)
 		return;
 
-	mem->next_scan_node_update = jiffies + 10*HZ;
 	/* make a nodemask where this memcg uses memory from */
 	mem->scan_nodes = node_states[N_HIGH_MEMORY];
 
@@ -1595,6 +1613,9 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
 		if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
 			node_clear(nid, mem->scan_nodes);
 	}
+
+	atomic_set(&mem->numainfo_events, 0);
+	atomic_set(&mem->numainfo_updating, 0);
 }
 
 /*
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] mm: preallocate page before lock_page() at filemap COW
  2011-06-28  8:31 [FIX][PATCH 0/3] memcg: 3 fixes for memory cgroup's memory reclaim KAMEZAWA Hiroyuki
  2011-06-28  8:39 ` [PATCH 1/3] memcg: fix reclaimable lru check in memcg KAMEZAWA Hiroyuki
  2011-06-28  8:41 ` [FIX][PATCH 2/3] memcg: fix numa scan information update to be triggered by memory event KAMEZAWA Hiroyuki
@ 2011-06-28  8:54 ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-28  8:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, akpm, Michal Hocko, nishimura, linux-kernel,
	Mel Gorman, Hugh Dickins, kosaki.motohiro

>From 32c2ca2c3b24945c7078122b99c3ed89fc84dcc2 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Tue, 28 Jun 2011 17:30:03 +0900
Subject: [PATCH 3/3] mm: preallocate page before lock_page() at filemap COW.

Currently we are keeping faulted page locked throughout whole __do_fault
call (except for page_mkwrite code path) after calling file system's
fault code. If we do early COW, we allocate a new page which has to be
charged for a memcg (mem_cgroup_newpage_charge).

This function, however, might block for unbounded amount of time if memcg
oom killer is disabled or fork-bomb is running because the only way out of
the OOM situation is either an external event or OOM-situation fix.

In the end we are keeping the faulted page locked and blocking other
processes from faulting it in which is not good at all because we are
basically punishing potentially an unrelated process for OOM condition
in a different group (I have seen stuck system because of ld-2.11.1.so being
locked).

We can do test easily.

 % cgcreate -g memory:A
 % cgset -r memory.limit_in_bytes=64M A
 % cgset -r memory.memsw.limit_in_bytes=64M A
 % cd kernel_dir; cgexec -g memory:A make -j

Then, the whole system will live-locked until you kill 'make -j'
by hands (or push reboot...) This is because some important page in a
a shared library are locked.

Considering again, the new page is not necessary to be allocated
with lock_page() held. And usual page allocation may dive into
long memory reclaim loop with holding lock_page() and can cause
very long latency.

There are 3 ways.
  1. do allocation/charge before lock_page()
     Pros. - simple and can handle page allocation in the same manner.
             This will reduce holding time of lock_page() in general.
     Cons. - we do page allocation even if ->fault() returns error.

  2. do charge after unlock_page(). Even if charge fails, it's just OOM.
     Pros. - no impact to non-memcg path.
     Cons. - implemenation requires special cares of LRU and we need to modify
             page_add_new_anon_rmap()...

  3. do unlock->charge->lock again method.
     Pros. - no impact to non-memcg path.  
     Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
             lock and get it again...

This patch moves "charge" and memory allocation for COW page
before lock_page(). Then, we can avoid scanning LRU with holding
a lock on a page and latency under lock_page() will be reduced.

Then, above livelock disappears.


Reported-by: Lutz Vieweg <lvml@5t9.de>
Original-idea-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memory.c |   56 ++++++++++++++++++++++++++++++++++----------------------
 1 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..7d7d3a2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3103,14 +3103,34 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *page_table;
 	spinlock_t *ptl;
 	struct page *page;
+	struct page *cow_page;
 	pte_t entry;
 	int anon = 0;
-	int charged = 0;
 	struct page *dirty_page = NULL;
 	struct vm_fault vmf;
 	int ret;
 	int page_mkwrite = 0;
 
+	/*
+	 * If we do COW later, allocate page befor taking lock_page()
+	 * on the file cache page. This will reduce lock holding time.
+	 */
+	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+
+		if (unlikely(anon_vma_prepare(vma)))
+			return  VM_FAULT_OOM;
+
+		cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+		if (!cow_page)
+			return VM_FAULT_OOM;
+
+		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
+			page_cache_release(cow_page);
+			return VM_FAULT_OOM;
+		}
+	} else
+		cow_page = NULL;
+
 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
@@ -3119,12 +3139,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
 			    VM_FAULT_RETRY)))
-		return ret;
+		goto uncharge_out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
 		if (ret & VM_FAULT_LOCKED)
 			unlock_page(vmf.page);
-		return VM_FAULT_HWPOISON;
+		ret = VM_FAULT_HWPOISON;
+		goto uncharge_out;
 	}
 
 	/*
@@ -3142,23 +3163,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = vmf.page;
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
+			page = cow_page;
 			anon = 1;
-			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
-			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
-						vma, address);
-			if (!page) {
-				ret = VM_FAULT_OOM;
-				goto out;
-			}
-			if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
-				ret = VM_FAULT_OOM;
-				page_cache_release(page);
-				goto out;
-			}
-			charged = 1;
 			copy_user_highpage(page, vmf.page, address, vma);
 			__SetPageUptodate(page);
 		} else {
@@ -3227,8 +3233,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, address, page_table);
 	} else {
-		if (charged)
-			mem_cgroup_uncharge_page(page);
+		if (cow_page)
+			mem_cgroup_uncharge_page(cow_page);
 		if (anon)
 			page_cache_release(page);
 		else
@@ -3237,7 +3243,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	pte_unmap_unlock(page_table, ptl);
 
-out:
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
 
@@ -3267,6 +3272,13 @@ out:
 unwritable_page:
 	page_cache_release(page);
 	return ret;
+uncharge_out:
+	/* fs's fault handler get error */
+	if (cow_page) {
+		mem_cgroup_uncharge_page(cow_page);
+		page_cache_release(cow_page);
+	}
+	return ret;
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [FIX][PATCH 2/3] memcg: fix numa scan information update to be triggered by memory event
  2011-06-28  8:41 ` [FIX][PATCH 2/3] memcg: fix numa scan information update to be triggered by memory event KAMEZAWA Hiroyuki
@ 2011-06-29 13:12   ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2011-06-29 13:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm, nishimura, linux-kernel

On Tue 28-06-11 17:41:50, KAMEZAWA Hiroyuki wrote:
> From 646ca5cd1e1ab0633892b86a1bbb6cf600d79d58 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Tue, 28 Jun 2011 17:09:25 +0900
> Subject: [PATCH 2/3] Fix numa scan information update to be triggered by memory event
> 
> commit 889976 adds an numa node round-robin for memcg. But the information
> is updated once per 10sec.
> 
> This patch changes the update trigger from jiffies to memcg's event count.
> After this patch, numa scan information will be updated when we see
> 1024 events of pagein/pageout under a memcg.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

See the note about wasted memory for MAX_NUMNODES==1 bellow.

> 
> Changelog:
>   - simplified
>   - removed mutex
>   - removed 3% check. To use heuristics, we cannot avoid magic value.
>     So, removed heuristics.
> ---
>  mm/memcontrol.c |   29 +++++++++++++++++++++++++----
>  1 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c624312..3e7d5e6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -108,10 +108,12 @@ enum mem_cgroup_events_index {
>  enum mem_cgroup_events_target {
>  	MEM_CGROUP_TARGET_THRESH,
>  	MEM_CGROUP_TARGET_SOFTLIMIT,
> +	MEM_CGROUP_TARGET_NUMAINFO,

This still wastes sizeof(unsigned long) per CPU space for non NUMA
machines (resp. MAX_NUMNODES==1).

[...]
> @@ -703,6 +709,14 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
>  			__mem_cgroup_target_update(mem,
>  				MEM_CGROUP_TARGET_SOFTLIMIT);
>  		}
> +#if MAX_NUMNODES > 1
> +		if (unlikely(__memcg_event_check(mem,
> +			MEM_CGROUP_TARGET_NUMAINFO))) {
> +			atomic_inc(&mem->numainfo_events);
> +			__mem_cgroup_target_update(mem,
> +				MEM_CGROUP_TARGET_NUMAINFO);
> +		}
> +#endif
>  	}
>  }
>  
> @@ -1582,11 +1596,15 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *mem,
>  static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
>  {
>  	int nid;
> -
> -	if (time_after(mem->next_scan_node_update, jiffies))
> +	/*
> +	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
> +	 * pagein/pageout changes since the last update.
> +	 */
> +	if (!atomic_read(&mem->numainfo_events))
> +		return;

At first I was worried about memory barriers here because
atomic_{set,inc} used for numainfo_events do not imply mem. barriers
but that is not a problem because memcg_check_events will always make
numainfo_events > 0 (even if it doesn't see atomic_set from this
function and we are not interested in the exact value).

> +	if (atomic_inc_return(&mem->numainfo_updating) > 1)
>  		return;

OK, this one should be barrier safe as well as this enforces barrier on
both sides (before and after operation) so the atomic_set shouldn't
break it AFAIU.

>  
> -	mem->next_scan_node_update = jiffies + 10*HZ;
>  	/* make a nodemask where this memcg uses memory from */
>  	mem->scan_nodes = node_states[N_HIGH_MEMORY];
>  
> @@ -1595,6 +1613,9 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
>  		if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
>  			node_clear(nid, mem->scan_nodes);
>  	}
> +
> +	atomic_set(&mem->numainfo_events, 0);
> +	atomic_set(&mem->numainfo_updating, 0);
>  }
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3] memcg: fix reclaimable lru check in memcg.
  2011-06-28  8:39 ` [PATCH 1/3] memcg: fix reclaimable lru check in memcg KAMEZAWA Hiroyuki
@ 2011-06-29 13:40   ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2011-06-29 13:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm, nishimura, linux-kernel

On Tue 28-06-11 17:39:58, KAMEZAWA Hiroyuki wrote:
> From b52bcd09843e903e5f184d0ee499909d072f3c8d Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Tue, 28 Jun 2011 15:45:38 +0900
> Subject: [PATCH 1/3] memcg: fix reclaimable lru check in memcg.
> 
> Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage()
> is used for checking whether the memcg contains reclaimable pages
> or not. If no pages in it, the routine skips it.
> 
> But, mem_cgroup_local_usage() contains Unevictable pages and cannot
> handle "noswap" condition correctly. This doesn't work on a swapless
> system.
> 
> This patch adds test_mem_cgroup_reclaimable() and replaces
> mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU
> counter and returns correct answer to the caller.
> And this new function has "noswap" argument and can see only
> FILE LRU if necessary.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

except for !CONFIG_NUMA issue - see bellow.

> @@ -1559,6 +1550,27 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>  	return ret;
>  }
>  
> +/** test_mem_cgroup_node_reclaimable
> + * @mem: the target memcg
> + * @nid: the node ID to be checked.
> + * @noswap : specify true here if the user wants flle only information.
> + *
> + * This function returns whether the specified memcg contains any
> + * reclaimable pages on a node. Returns true if there are any reclaimable
> + * pages in the node.
> + */
> +static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *mem,
> +		int nid, bool noswap)
> +{
> +	if (mem_cgroup_node_nr_file_lru_pages(mem, nid))
> +		return true;

I do not see definition of mem_cgroup_node_nr_file_lru_pages for
!MAX_NUMNODES==1 (resp. !CONFIG_NUMA) and you are calling this function
also from that context.

>  #if MAX_NUMNODES > 1
>  
>  /*
> @@ -1580,15 +1592,8 @@ static void mem_cgroup_may_update_nodemask(struct mem_cgroup *mem)
>  
>  	for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
>  
> -		if (mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_FILE) ||
> -		    mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_FILE))
> -			continue;
> -
> -		if (total_swap_pages &&
> -		    (mem_cgroup_get_zonestat_node(mem, nid, LRU_INACTIVE_ANON) ||
> -		     mem_cgroup_get_zonestat_node(mem, nid, LRU_ACTIVE_ANON)))
> -			continue;
> -		node_clear(nid, mem->scan_nodes);
> +		if (!test_mem_cgroup_node_reclaimable(mem, nid, false))
> +			node_clear(nid, mem->scan_nodes);

Nice clean up.

> @@ -1627,11 +1632,51 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
>  	return node;
>  }
>  
> +/*
> + * Check all nodes whether it contains reclaimable pages or not.
> + * For quick scan, we make use of scan_nodes. This will allow us to skip
> + * unused nodes. But scan_nodes is lazily updated and may not cotain
> + * enough new information. We need to do double check.
> + */
> +bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
> +{
> +	int nid;
> +
> +	/*
> +	 * quick check...making use of scan_node.
> +	 * We can skip unused nodes.
> +	 */
> +	if (!nodes_empty(mem->scan_nodes)) {
> +		for (nid = first_node(mem->scan_nodes);
> +		     nid < MAX_NUMNODES;
> +		     nid = next_node(nid, mem->scan_nodes)) {
> +
> +			if (test_mem_cgroup_node_reclaimable(mem, nid, noswap))
> +				return true;
> +		}
> +	}
> +	/*
> + 	 * Check rest of nodes.
> + 	 */
> +	for_each_node_state(nid, N_HIGH_MEMORY) {
> +		if (node_isset(nid, mem->scan_nodes))
> +			continue;
> +		if (test_mem_cgroup_node_reclaimable(mem, nid, noswap))
> +			return true;	
> +	}
> +	return false;
> +}
> +
>  #else

This is #else if MAX_NUMNODES == 1 AFAICS

>  int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
>  {
>  	return 0;
>  }
> +
> +bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap)
> +{
> +	return test_mem_cgroup_node_reclaimable(mem, 0, noswap);
> +}
>  #endif

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-06-29 13:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-28  8:31 [FIX][PATCH 0/3] memcg: 3 fixes for memory cgroup's memory reclaim KAMEZAWA Hiroyuki
2011-06-28  8:39 ` [PATCH 1/3] memcg: fix reclaimable lru check in memcg KAMEZAWA Hiroyuki
2011-06-29 13:40   ` Michal Hocko
2011-06-28  8:41 ` [FIX][PATCH 2/3] memcg: fix numa scan information update to be triggered by memory event KAMEZAWA Hiroyuki
2011-06-29 13:12   ` Michal Hocko
2011-06-28  8:54 ` [PATCH 3/3] mm: preallocate page before lock_page() at filemap COW KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).