All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] fix stale swap cache account leak  in memcg v7
@ 2009-05-12  1:44 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  1:44 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, akpm, mingo, linux-kernel

I hope this version gets acks..
==
As Nishimura reported, there is a race at handling swap cache.

Typical cases are following (from Nishimura's mail)


== Type-1 ==
  If some pages of processA has been swapped out, it calls free_swap_and_cache().
  And if at the same time, processB is calling read_swap_cache_async() about
  a swap entry *that is used by processA*, a race like below can happen.

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
      swap_entry_free() == 0           |
      find_get_page() -> found         |
      try_lock_page() -> fail & return |
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.

  This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.


== Type-2 ==
    Assume processA is exiting and pte points to a page(!PageSwapCache).
    And processB is trying reclaim the page.

              processA                   |           processB
    -------------------------------------+-------------------------------------
      (page_remove_rmap())               |  (shrink_page_list())
         mem_cgroup_uncharge_page()      |
            ->uncharged because it's not |
              PageSwapCache yet.         |
              So, both mem/memsw.usage   |
              are decremented.           |
                                         |    add_to_swap() -> added to swap cache.

    If this page goes thorough without being freed for some reason, this page
    doesn't goes back to memcg's LRU because of !PageCgroupUsed.


Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
swapin-readahead just read swp_entries which are near to requested entry. So,
pages not to be used can be on memory (on global LRU). When memcg is used,
this is not good behavior anyway.

Considering Type-2, the page should be freed from SwapCache right after WriteBack.
Free swapped out pages as soon as possible is a good nature to memcg, anyway.

The patch set includes followng
 [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
 [2/3] fix swap cache handling race by avoidng readahead.
 [3/3] fix swap cache handling race by check swapcount again.

Result is good under my test.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 0/3] fix stale swap cache account leak  in memcg v7
@ 2009-05-12  1:44 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  1:44 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, akpm, mingo, linux-kernel

I hope this version gets acks..
==
As Nishimura reported, there is a race at handling swap cache.

Typical cases are following (from Nishimura's mail)


== Type-1 ==
  If some pages of processA has been swapped out, it calls free_swap_and_cache().
  And if at the same time, processB is calling read_swap_cache_async() about
  a swap entry *that is used by processA*, a race like below can happen.

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
      swap_entry_free() == 0           |
      find_get_page() -> found         |
      try_lock_page() -> fail & return |
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.

  This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.


== Type-2 ==
    Assume processA is exiting and pte points to a page(!PageSwapCache).
    And processB is trying reclaim the page.

              processA                   |           processB
    -------------------------------------+-------------------------------------
      (page_remove_rmap())               |  (shrink_page_list())
         mem_cgroup_uncharge_page()      |
            ->uncharged because it's not |
              PageSwapCache yet.         |
              So, both mem/memsw.usage   |
              are decremented.           |
                                         |    add_to_swap() -> added to swap cache.

    If this page goes thorough without being freed for some reason, this page
    doesn't goes back to memcg's LRU because of !PageCgroupUsed.


Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
swapin-readahead just read swp_entries which are near to requested entry. So,
pages not to be used can be on memory (on global LRU). When memcg is used,
this is not good behavior anyway.

Considering Type-2, the page should be freed from SwapCache right after WriteBack.
Free swapped out pages as soon as possible is a good nature to memcg, anyway.

The patch set includes followng
 [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
 [2/3] fix swap cache handling race by avoidng readahead.
 [3/3] fix swap cache handling race by check swapcount again.

Result is good under my test.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/3] add check for mem cgroup is activated
  2009-05-12  1:44 ` KAMEZAWA Hiroyuki
@ 2009-05-12  1:45   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  1:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

There is a function "mem_cgroup_disabled()" which returns
  - memcg is configured ?
  - disabled by boot option ?
This is check is useful to confirm whether we have to call memcg's hook or not.

But, even when memcg is configured (and not disabled), it's not really used
until mounted. This patch adds mem_cgroup_activated() to check memcg is
mounted or not at least once.
(Will be used in later patch.)

IIUC, only very careful users set boot option of memcg to be disabled and
most of people will not be aware of that memcg is enabled at default.
So, if memcg wants to affect to global VM behavior or to add some overheads,
there is cases that this check is better than mem_cgroup_disabled().

Acked-by: Balbir Singh <balbir@linux.vnet.bim.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: mmotm-2.6.30-May07/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.30-May07.orig/include/linux/memcontrol.h
+++ mmotm-2.6.30-May07/include/linux/memcontrol.h
@@ -115,6 +115,8 @@ static inline bool mem_cgroup_disabled(v
 		return true;
 	return false;
 }
+/* Returns strue if mem_cgroup is enabled and really used (mounted). */
+bool mem_cgroup_activated(void);
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
@@ -229,6 +231,11 @@ static inline bool mem_cgroup_disabled(v
 	return true;
 }
 
+static inline bool mem_cgroup_activated(void)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_oom_called(struct task_struct *task)
 {
 	return false;
Index: mmotm-2.6.30-May07/mm/memcontrol.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/memcontrol.c
+++ mmotm-2.6.30-May07/mm/memcontrol.c
@@ -2577,6 +2577,17 @@ static void mem_cgroup_move_task(struct 
 	mutex_unlock(&memcg_tasklist);
 }
 
+static bool __mem_cgroup_activated = false;
+bool mem_cgroup_activated(void)
+{
+	return __mem_cgroup_activated;
+}
+
+static void mem_cgroup_bind(struct cgroup_subsys *ss, struct cgroup *root)
+{
+	__mem_cgroup_activated = true;
+}
+
 struct cgroup_subsys mem_cgroup_subsys = {
 	.name = "memory",
 	.subsys_id = mem_cgroup_subsys_id,
@@ -2585,6 +2596,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,
+	.bind = mem_cgroup_bind,
 	.early_init = 0,
 	.use_id = 1,
 };


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/3] add check for mem cgroup is activated
@ 2009-05-12  1:45   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  1:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

There is a function "mem_cgroup_disabled()" which returns
  - memcg is configured ?
  - disabled by boot option ?
This is check is useful to confirm whether we have to call memcg's hook or not.

But, even when memcg is configured (and not disabled), it's not really used
until mounted. This patch adds mem_cgroup_activated() to check memcg is
mounted or not at least once.
(Will be used in later patch.)

IIUC, only very careful users set boot option of memcg to be disabled and
most of people will not be aware of that memcg is enabled at default.
So, if memcg wants to affect to global VM behavior or to add some overheads,
there is cases that this check is better than mem_cgroup_disabled().

Acked-by: Balbir Singh <balbir@linux.vnet.bim.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: mmotm-2.6.30-May07/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.30-May07.orig/include/linux/memcontrol.h
+++ mmotm-2.6.30-May07/include/linux/memcontrol.h
@@ -115,6 +115,8 @@ static inline bool mem_cgroup_disabled(v
 		return true;
 	return false;
 }
+/* Returns strue if mem_cgroup is enabled and really used (mounted). */
+bool mem_cgroup_activated(void);
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
@@ -229,6 +231,11 @@ static inline bool mem_cgroup_disabled(v
 	return true;
 }
 
+static inline bool mem_cgroup_activated(void)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_oom_called(struct task_struct *task)
 {
 	return false;
Index: mmotm-2.6.30-May07/mm/memcontrol.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/memcontrol.c
+++ mmotm-2.6.30-May07/mm/memcontrol.c
@@ -2577,6 +2577,17 @@ static void mem_cgroup_move_task(struct 
 	mutex_unlock(&memcg_tasklist);
 }
 
+static bool __mem_cgroup_activated = false;
+bool mem_cgroup_activated(void)
+{
+	return __mem_cgroup_activated;
+}
+
+static void mem_cgroup_bind(struct cgroup_subsys *ss, struct cgroup *root)
+{
+	__mem_cgroup_activated = true;
+}
+
 struct cgroup_subsys mem_cgroup_subsys = {
 	.name = "memory",
 	.subsys_id = mem_cgroup_subsys_id,
@@ -2585,6 +2596,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,
+	.bind = mem_cgroup_bind,
 	.early_init = 0,
 	.use_id = 1,
 };

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/3] fix swap cache account leak at swapin-readahead
  2009-05-12  1:44 ` KAMEZAWA Hiroyuki
@ 2009-05-12  1:46   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  1:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In general, Linux's swp_entry handling is done by combination of lazy techniques
and global LRU. It works well but when we use mem+swap controller, some more
strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
never freed until global LRU works. In a system where memcg is well-configured,
global LRU doesn't work frequently.

  Example) Assume swapin-readahead.
	      CPU0			      CPU1
	   zap_pte()			  read_swap_cache_async()
					  swap_duplicate().
           swap_entry_free() = 1
	   find_get_page()=> NULL.
					  add_to_swap_cache().
					  issue swap I/O. 

There are many patterns of this kind of race (but no problems).

free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
function. If the swp_entry/page seems busy, swp_entry is not freed.
This is not a problem because global-LRU will find SwapCache at page reclaim.

If memcg is used, on the other hand, global LRU may not work. Then, above
unused SwapCache will not be freed.
(unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)

So, even if there are no tasks in a cgroup, swp_entry usage still remains.
In bad case, OOM by mem+swap controller is triggered by this "leak" of
swp_entry as Nishimura reported.

Considering this issue, swapin-readahead itself is not very good for memcg.
It read swap cache which will not be used. (and _unused_ swapcache will
not be accounted.) Even if we account swap cache at add_to_swap_cache(),
we need to account page to several _unrelated_ memcg. This is bad.

This patch tries to fix racy case of free_swap_and_cache() and page status.

After this patch applied, following test works well.

  # echo 1-2M > ../memory.limit_in_bytes
  # run tasks under memcg.
  # kill all tasks and make memory.tasks empty
  # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
    there is no _used_ swp_entry.

What this patch does is
 - avoid swapin-readahead when memcg is activated.

Changelog: v6 -> v7
 - just handle races in readahead.
 - races in writeback is handled in the next patch.

Changelog: v5 -> v6
 - works only when memcg is activated.
 - check after I/O works only after writeback.
 - avoid swapin-readahead when memcg is activated.
 - fixed page refcnt issue.
Changelog: v4->v5
 - completely new design.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/swap_state.c |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Index: mmotm-2.6.30-May07/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/swap_state.c
+++ mmotm-2.6.30-May07/mm/swap_state.c
@@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	int nr_pages;
+	int nr_pages = 1;
 	struct page *page;
-	unsigned long offset;
+	unsigned long offset = 0;
 	unsigned long end_offset;
 
 	/*
@@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
 	 * No, it's very unlikely that swap layout would follow vma layout,
 	 * more likely that neighbouring swap pages came from the same node:
 	 * so use the same "addr" to choose the same node for each swap read.
+	 *
+	 * But, when memcg is used, swapin readahead give us some bad
+	 * effects. There are 2 big problems in general.
+	 * 1. Swapin readahead tend to use/read _not required_ memory.
+	 *    And _not required_ memory is only freed by global LRU.
+	 * 2. We can't charge pages for swap-cache readahead because
+	 *    we should avoid account memory in a cgroup which a
+	 *    thread call this function is not related to.
+	 * And swapin-readahead have racy condition with
+	 * free_swap_and_cache(). This also annoys memcg.
+	 * Then, if memcg is really used, we avoid readahead.
 	 */
-	nr_pages = valid_swaphandles(entry, &offset);
+
+	if (!mem_cgroup_activated())
+		nr_pages = valid_swaphandles(entry, &offset);
+
 	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/3] fix swap cache account leak at swapin-readahead
@ 2009-05-12  1:46   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  1:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In general, Linux's swp_entry handling is done by combination of lazy techniques
and global LRU. It works well but when we use mem+swap controller, some more
strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
never freed until global LRU works. In a system where memcg is well-configured,
global LRU doesn't work frequently.

  Example) Assume swapin-readahead.
	      CPU0			      CPU1
	   zap_pte()			  read_swap_cache_async()
					  swap_duplicate().
           swap_entry_free() = 1
	   find_get_page()=> NULL.
					  add_to_swap_cache().
					  issue swap I/O. 

There are many patterns of this kind of race (but no problems).

free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
function. If the swp_entry/page seems busy, swp_entry is not freed.
This is not a problem because global-LRU will find SwapCache at page reclaim.

If memcg is used, on the other hand, global LRU may not work. Then, above
unused SwapCache will not be freed.
(unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)

So, even if there are no tasks in a cgroup, swp_entry usage still remains.
In bad case, OOM by mem+swap controller is triggered by this "leak" of
swp_entry as Nishimura reported.

Considering this issue, swapin-readahead itself is not very good for memcg.
It read swap cache which will not be used. (and _unused_ swapcache will
not be accounted.) Even if we account swap cache at add_to_swap_cache(),
we need to account page to several _unrelated_ memcg. This is bad.

This patch tries to fix racy case of free_swap_and_cache() and page status.

After this patch applied, following test works well.

  # echo 1-2M > ../memory.limit_in_bytes
  # run tasks under memcg.
  # kill all tasks and make memory.tasks empty
  # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
    there is no _used_ swp_entry.

What this patch does is
 - avoid swapin-readahead when memcg is activated.

Changelog: v6 -> v7
 - just handle races in readahead.
 - races in writeback is handled in the next patch.

Changelog: v5 -> v6
 - works only when memcg is activated.
 - check after I/O works only after writeback.
 - avoid swapin-readahead when memcg is activated.
 - fixed page refcnt issue.
Changelog: v4->v5
 - completely new design.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/swap_state.c |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Index: mmotm-2.6.30-May07/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/swap_state.c
+++ mmotm-2.6.30-May07/mm/swap_state.c
@@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	int nr_pages;
+	int nr_pages = 1;
 	struct page *page;
-	unsigned long offset;
+	unsigned long offset = 0;
 	unsigned long end_offset;
 
 	/*
@@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
 	 * No, it's very unlikely that swap layout would follow vma layout,
 	 * more likely that neighbouring swap pages came from the same node:
 	 * so use the same "addr" to choose the same node for each swap read.
+	 *
+	 * But, when memcg is used, swapin readahead give us some bad
+	 * effects. There are 2 big problems in general.
+	 * 1. Swapin readahead tend to use/read _not required_ memory.
+	 *    And _not required_ memory is only freed by global LRU.
+	 * 2. We can't charge pages for swap-cache readahead because
+	 *    we should avoid account memory in a cgroup which a
+	 *    thread call this function is not related to.
+	 * And swapin-readahead have racy condition with
+	 * free_swap_and_cache(). This also annoys memcg.
+	 * Then, if memcg is really used, we avoid readahead.
 	 */
-	nr_pages = valid_swaphandles(entry, &offset);
+
+	if (!mem_cgroup_activated())
+		nr_pages = valid_swaphandles(entry, &offset);
+
 	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 3/3] fix stale swap cache at writeback.
  2009-05-12  1:44 ` KAMEZAWA Hiroyuki
@ 2009-05-12  1:47   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  1:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

memcg: free unused swapcache on swapout path

Recaliming anonymous memory in vmscan.c does following 2 steps.
  1. add to swap and unmap.
  2. pageout
But above 2 steps doesn't occur at once. There are many chances
to avoid pageout and _really_ unused pages are swapped out by
visit-and-check-again logic of LRU rotation.
But this behavior has troubles with memcg.

memcg cannot handle !PageCgroupUsed swapcache the owner process of which
has been exited.

This patch is for handling such swap caches created by a race like below:

    Assume processA is exiting and pte points to a page(!PageSwapCache).
    And processB is trying reclaim the page.

              processA                   |           processB
    -------------------------------------+-------------------------------------
      (page_remove_rmap())               |  (shrink_page_list())
         mem_cgroup_uncharge_page()      |
            ->uncharged because it's not |
              PageSwapCache yet.         |
              So, both mem/memsw.usage   |
              are decremented.           |
                                         |    add_to_swap() -> added to swap cache.

    If this page goes thorough without being freed for some reason, this page
    doesn't goes back to memcg's LRU because of !PageCgroupUsed.

These swap cache cannot be freed in memcg's LRU scanning, and swp_entry cannot
be freed properly as a result.
This patch adds a hook after add_to_swap() to check the page is mapped by a
process or not, and frees it if it has been unmapped already.

If a page has been on swap cache already when the owner process calls
page_remove_rmap() -> mem_cgroup_uncharge_page(), the page is not uncharged.
It goes back to memcg's LRU even if it goes through shrink_page_list()
without being freed, so this patch ignores these case.

Changelog: from Nishimura's original one.
 - moved functions to vmscan.c

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
Index: mmotm-2.6.30-May07/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/vmscan.c
+++ mmotm-2.6.30-May07/mm/vmscan.c
@@ -586,6 +586,32 @@ void putback_lru_page(struct page *page)
 }
 #endif /* CONFIG_UNEVICTABLE_LRU */
 
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) && defined(CONFIG_SWAP)
+
+static int memcg_free_unused_swapcache(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(!PageSwapCache(page));
+
+	if (mem_cgroup_disabled())
+		return 0;
+	/*
+	 * What we do here is checking the page is accounted by memcg or not.
+	 * page_mapped() is enough check for avoding race.
+	 */
+	if (!PageAnon(page) || page_mapped(page))
+		return 0;
+	return try_to_free_swap(page);	/* checks page_swapcount */
+}
+
+#else
+
+static int memcg_free_unused_swapcache(struct page *page)
+{
+	return 0;
+}
+
+#endif
 
 /*
  * shrink_page_list() returns the number of reclaimed pages
@@ -663,6 +689,14 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			if (!add_to_swap(page))
 				goto activate_locked;
+			/*
+			 * The owner process might have uncharged the page
+			 * (by page_remove_rmap()) before it has been added
+			 * to swap cache.
+			 * Check it here to avoid making it stale.
+			 */
+			if (memcg_free_unused_swapcache(page))
+				goto keep_locked;
 			may_enter_fs = 1;
 		}
 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 3/3] fix stale swap cache at writeback.
@ 2009-05-12  1:47   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  1:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

memcg: free unused swapcache on swapout path

Recaliming anonymous memory in vmscan.c does following 2 steps.
  1. add to swap and unmap.
  2. pageout
But above 2 steps doesn't occur at once. There are many chances
to avoid pageout and _really_ unused pages are swapped out by
visit-and-check-again logic of LRU rotation.
But this behavior has troubles with memcg.

memcg cannot handle !PageCgroupUsed swapcache the owner process of which
has been exited.

This patch is for handling such swap caches created by a race like below:

    Assume processA is exiting and pte points to a page(!PageSwapCache).
    And processB is trying reclaim the page.

              processA                   |           processB
    -------------------------------------+-------------------------------------
      (page_remove_rmap())               |  (shrink_page_list())
         mem_cgroup_uncharge_page()      |
            ->uncharged because it's not |
              PageSwapCache yet.         |
              So, both mem/memsw.usage   |
              are decremented.           |
                                         |    add_to_swap() -> added to swap cache.

    If this page goes thorough without being freed for some reason, this page
    doesn't goes back to memcg's LRU because of !PageCgroupUsed.

These swap cache cannot be freed in memcg's LRU scanning, and swp_entry cannot
be freed properly as a result.
This patch adds a hook after add_to_swap() to check the page is mapped by a
process or not, and frees it if it has been unmapped already.

If a page has been on swap cache already when the owner process calls
page_remove_rmap() -> mem_cgroup_uncharge_page(), the page is not uncharged.
It goes back to memcg's LRU even if it goes through shrink_page_list()
without being freed, so this patch ignores these case.

Changelog: from Nishimura's original one.
 - moved functions to vmscan.c

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
Index: mmotm-2.6.30-May07/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/vmscan.c
+++ mmotm-2.6.30-May07/mm/vmscan.c
@@ -586,6 +586,32 @@ void putback_lru_page(struct page *page)
 }
 #endif /* CONFIG_UNEVICTABLE_LRU */
 
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) && defined(CONFIG_SWAP)
+
+static int memcg_free_unused_swapcache(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(!PageSwapCache(page));
+
+	if (mem_cgroup_disabled())
+		return 0;
+	/*
+	 * What we do here is checking the page is accounted by memcg or not.
+	 * page_mapped() is enough check for avoding race.
+	 */
+	if (!PageAnon(page) || page_mapped(page))
+		return 0;
+	return try_to_free_swap(page);	/* checks page_swapcount */
+}
+
+#else
+
+static int memcg_free_unused_swapcache(struct page *page)
+{
+	return 0;
+}
+
+#endif
 
 /*
  * shrink_page_list() returns the number of reclaimed pages
@@ -663,6 +689,14 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			if (!add_to_swap(page))
 				goto activate_locked;
+			/*
+			 * The owner process might have uncharged the page
+			 * (by page_remove_rmap()) before it has been added
+			 * to swap cache.
+			 * Check it here to avoid making it stale.
+			 */
+			if (memcg_free_unused_swapcache(page))
+				goto keep_locked;
 			may_enter_fs = 1;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
  2009-05-12  1:46   ` KAMEZAWA Hiroyuki
@ 2009-05-12  4:32     ` Daisuke Nishimura
  -1 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-12  4:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 10:46:03 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, Linux's swp_entry handling is done by combination of lazy techniques
> and global LRU. It works well but when we use mem+swap controller, some more
> strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> never freed until global LRU works. In a system where memcg is well-configured,
> global LRU doesn't work frequently.
> 
>   Example) Assume swapin-readahead.
> 	      CPU0			      CPU1
> 	   zap_pte()			  read_swap_cache_async()
> 					  swap_duplicate().
>            swap_entry_free() = 1
> 	   find_get_page()=> NULL.
> 					  add_to_swap_cache().
> 					  issue swap I/O. 
> 
> There are many patterns of this kind of race (but no problems).
> 
> free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> function. If the swp_entry/page seems busy, swp_entry is not freed.
> This is not a problem because global-LRU will find SwapCache at page reclaim.
> 
> If memcg is used, on the other hand, global LRU may not work. Then, above
> unused SwapCache will not be freed.
> (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> 
> So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> In bad case, OOM by mem+swap controller is triggered by this "leak" of
> swp_entry as Nishimura reported.
> 
> Considering this issue, swapin-readahead itself is not very good for memcg.
> It read swap cache which will not be used. (and _unused_ swapcache will
> not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> we need to account page to several _unrelated_ memcg. This is bad.
> 
> This patch tries to fix racy case of free_swap_and_cache() and page status.
> 
> After this patch applied, following test works well.
> 
>   # echo 1-2M > ../memory.limit_in_bytes
>   # run tasks under memcg.
>   # kill all tasks and make memory.tasks empty
>   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
>     there is no _used_ swp_entry.
> 
> What this patch does is
>  - avoid swapin-readahead when memcg is activated.
> 
> Changelog: v6 -> v7
>  - just handle races in readahead.
>  - races in writeback is handled in the next patch.
> 
> Changelog: v5 -> v6
>  - works only when memcg is activated.
>  - check after I/O works only after writeback.
>  - avoid swapin-readahead when memcg is activated.
>  - fixed page refcnt issue.
> Changelog: v4->v5
>  - completely new design.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Looks good to me.

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
@ 2009-05-12  4:32     ` Daisuke Nishimura
  0 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-12  4:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 10:46:03 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, Linux's swp_entry handling is done by combination of lazy techniques
> and global LRU. It works well but when we use mem+swap controller, some more
> strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> never freed until global LRU works. In a system where memcg is well-configured,
> global LRU doesn't work frequently.
> 
>   Example) Assume swapin-readahead.
> 	      CPU0			      CPU1
> 	   zap_pte()			  read_swap_cache_async()
> 					  swap_duplicate().
>            swap_entry_free() = 1
> 	   find_get_page()=> NULL.
> 					  add_to_swap_cache().
> 					  issue swap I/O. 
> 
> There are many patterns of this kind of race (but no problems).
> 
> free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> function. If the swp_entry/page seems busy, swp_entry is not freed.
> This is not a problem because global-LRU will find SwapCache at page reclaim.
> 
> If memcg is used, on the other hand, global LRU may not work. Then, above
> unused SwapCache will not be freed.
> (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> 
> So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> In bad case, OOM by mem+swap controller is triggered by this "leak" of
> swp_entry as Nishimura reported.
> 
> Considering this issue, swapin-readahead itself is not very good for memcg.
> It read swap cache which will not be used. (and _unused_ swapcache will
> not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> we need to account page to several _unrelated_ memcg. This is bad.
> 
> This patch tries to fix racy case of free_swap_and_cache() and page status.
> 
> After this patch applied, following test works well.
> 
>   # echo 1-2M > ../memory.limit_in_bytes
>   # run tasks under memcg.
>   # kill all tasks and make memory.tasks empty
>   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
>     there is no _used_ swp_entry.
> 
> What this patch does is
>  - avoid swapin-readahead when memcg is activated.
> 
> Changelog: v6 -> v7
>  - just handle races in readahead.
>  - races in writeback is handled in the next patch.
> 
> Changelog: v5 -> v6
>  - works only when memcg is activated.
>  - check after I/O works only after writeback.
>  - avoid swapin-readahead when memcg is activated.
>  - fixed page refcnt issue.
> Changelog: v4->v5
>  - completely new design.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Looks good to me.

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 4/3] memcg: call uncharge_swapcache outside of tree_lock (Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7)
  2009-05-12  1:44 ` KAMEZAWA Hiroyuki
@ 2009-05-12  5:06   ` Daisuke Nishimura
  -1 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-12  5:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 10:44:01 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I hope this version gets acks..
> ==
> As Nishimura reported, there is a race at handling swap cache.
> 
> Typical cases are following (from Nishimura's mail)
> 
> 
> == Type-1 ==
>   If some pages of processA has been swapped out, it calls free_swap_and_cache().
>   And if at the same time, processB is calling read_swap_cache_async() about
>   a swap entry *that is used by processA*, a race like below can happen.
> 
>             processA                   |           processB
>   -------------------------------------+-------------------------------------
>     (free_swap_and_cache())            |  (read_swap_cache_async())
>                                        |    swap_duplicate()
>                                        |    __set_page_locked()
>                                        |    add_to_swap_cache()
>       swap_entry_free() == 0           |
>       find_get_page() -> found         |
>       try_lock_page() -> fail & return |
>                                        |    lru_cache_add_anon()
>                                        |      doesn't link this page to memcg's
>                                        |      LRU, because of !PageCgroupUsed.
> 
>   This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.
> 
> 
> == Type-2 ==
>     Assume processA is exiting and pte points to a page(!PageSwapCache).
>     And processB is trying reclaim the page.
> 
>               processA                   |           processB
>     -------------------------------------+-------------------------------------
>       (page_remove_rmap())               |  (shrink_page_list())
>          mem_cgroup_uncharge_page()      |
>             ->uncharged because it's not |
>               PageSwapCache yet.         |
>               So, both mem/memsw.usage   |
>               are decremented.           |
>                                          |    add_to_swap() -> added to swap cache.
> 
>     If this page goes thorough without being freed for some reason, this page
>     doesn't goes back to memcg's LRU because of !PageCgroupUsed.
> 
> 
> Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
> swapin-readahead just read swp_entries which are near to requested entry. So,
> pages not to be used can be on memory (on global LRU). When memcg is used,
> this is not good behavior anyway.
> 
> Considering Type-2, the page should be freed from SwapCache right after WriteBack.
> Free swapped out pages as soon as possible is a good nature to memcg, anyway.
> 
> The patch set includes followng
>  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
>  [2/3] fix swap cache handling race by avoidng readahead.
>  [3/3] fix swap cache handling race by check swapcount again.
> 
> Result is good under my test.
> 
These patches seem to work well on my side too.


BTW, we need one more fix which I found in a long term test last week.
After this patch, it survived all through the weekend in my test.

I don't know why we've never hit this bug so far.
I think I hit it because my memcg_free_unused_swapcache() patch increases
the possibility of calling mem_cgroup_uncharge_swapcache().

Thanks,
Daisuke Nishimura.
===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

memcg: call mem_cgroup_uncharge_swapcache() outside of tree_lock

It's rare, but I hit a spinlock lockup.

    BUG: spinlock lockup on CPU#2, page01/24205, ffffffff806faf18
    Pid: 24205, comm: page01 Not tainted 2.6.30-rc4-5845621d #1
    Call Trace:
     <IRQ>  [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
     [<ffffffff803752bc>] ? _raw_spin_lock+0xfb/0x122
     [<ffffffff804ee9ba>] ? _spin_lock_irqsave+0x59/0x70
     [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
     [<ffffffff8029649c>] ? rotate_reclaimable_page+0x87/0x8e
     [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
     [<ffffffff8028d4ed>] ? end_page_writeback+0x1c/0x3d
     [<ffffffff802ac4c6>] ? end_swap_bio_write+0x57/0x65
     [<ffffffff803516d9>] ? __end_that_request_first+0x1f3/0x2e4
     [<ffffffff803515f2>] ? __end_that_request_first+0x10c/0x2e4
     [<ffffffff803517e5>] ? end_that_request_data+0x1b/0x4c
     [<ffffffff8035225a>] ? blk_end_io+0x1c/0x76
     [<ffffffffa0060702>] ? scsi_io_completion+0x1dc/0x467 [scsi_mod]
     [<ffffffff80356394>] ? blk_done_softirq+0x66/0x76
     [<ffffffff8024002e>] ? __do_softirq+0xae/0x182
     [<ffffffff8020cb3c>] ? call_softirq+0x1c/0x2a
     [<ffffffff8020de9a>] ? do_softirq+0x31/0x83
     [<ffffffff8020d57b>] ? do_IRQ+0xa9/0xbf
     [<ffffffff8020c393>] ? ret_from_intr+0x0/0xf
     <EOI>  [<ffffffff8026fd56>] ? res_counter_uncharge+0x67/0x70
     [<ffffffff802bf04a>] ? __mem_cgroup_uncharge_common+0xbd/0x158
     [<ffffffff802a0f55>] ? unmap_vmas+0x7ef/0x829
     [<ffffffff802a8a3a>] ? page_remove_rmap+0x1b/0x36
     [<ffffffff802a0c11>] ? unmap_vmas+0x4ab/0x829
     [<ffffffff802a5243>] ? exit_mmap+0xa7/0x11c
     [<ffffffff80239009>] ? mmput+0x41/0x9f
     [<ffffffff8023cf7b>] ? exit_mm+0x101/0x10c
     [<ffffffff8023e481>] ? do_exit+0x1a4/0x61e
     [<ffffffff80259391>] ? trace_hardirqs_on_caller+0x11d/0x148
     [<ffffffff8023e96e>] ? do_group_exit+0x73/0xa5
     [<ffffffff8023e9b2>] ? sys_exit_group+0x12/0x16
     [<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b

This is caused when:

CPU1: __mem_cgroup_uncharge_common(), which is holding page_cgroup lock,
      is interuppted and end_swap_bio_write(), which tries to hold
      swapper_space.tree_lock, is called in the interrupt context.
CPU2: mem_cgroup_uncharge_swapcache(), which is called under swapper_space.tree_lock,
      is spinning at lock_page_cgroup() in __mem_cgroup_uncharge_common().

IIUC, there is no need that mem_cgroup_uncharge_swapcache() should be called under
swapper_space.tree.lock, so move it outside the tree_lock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/swap.h |    5 +++++
 mm/memcontrol.c      |    2 +-
 mm/swap_state.c      |    4 +---
 mm/vmscan.c          |    1 +
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index caf0767..6ea541d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
 #define has_swap_token(x) 0
 #define disable_swap_token() do { } while(0)
 
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+
 #endif /* CONFIG_SWAP */
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c9c1ad..379f200 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1489,7 +1489,7 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
 }
 
 /*
- * called from __delete_from_swap_cache() and drop "page" account.
+ * called after __delete_from_swap_cache() and drop "page" account.
  * memcg information is recorded to swap_cgroup of "ent"
  */
 void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e389ef2..7624c89 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
  */
 void __delete_from_swap_cache(struct page *page)
 {
-	swp_entry_t ent = {.val = page_private(page)};
-
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapCache(page));
 	VM_BUG_ON(PageWriteback(page));
@@ -121,7 +119,6 @@ void __delete_from_swap_cache(struct page *page)
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
-	mem_cgroup_uncharge_swapcache(page, ent);
 }
 
 /**
@@ -191,6 +188,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&swapper_space.tree_lock);
 
+	mem_cgroup_uncharge_swapcache(page, entry);
 	swap_free(entry);
 	page_cache_release(page);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 337be66..e674cd1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -470,6 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_uncharge_swapcache(page, swap);
 		swap_free(swap);
 	} else {
 		__remove_from_page_cache(page);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/3] memcg: call uncharge_swapcache outside of tree_lock (Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7)
@ 2009-05-12  5:06   ` Daisuke Nishimura
  0 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-12  5:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 10:44:01 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I hope this version gets acks..
> ==
> As Nishimura reported, there is a race at handling swap cache.
> 
> Typical cases are following (from Nishimura's mail)
> 
> 
> == Type-1 ==
>   If some pages of processA has been swapped out, it calls free_swap_and_cache().
>   And if at the same time, processB is calling read_swap_cache_async() about
>   a swap entry *that is used by processA*, a race like below can happen.
> 
>             processA                   |           processB
>   -------------------------------------+-------------------------------------
>     (free_swap_and_cache())            |  (read_swap_cache_async())
>                                        |    swap_duplicate()
>                                        |    __set_page_locked()
>                                        |    add_to_swap_cache()
>       swap_entry_free() == 0           |
>       find_get_page() -> found         |
>       try_lock_page() -> fail & return |
>                                        |    lru_cache_add_anon()
>                                        |      doesn't link this page to memcg's
>                                        |      LRU, because of !PageCgroupUsed.
> 
>   This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.
> 
> 
> == Type-2 ==
>     Assume processA is exiting and pte points to a page(!PageSwapCache).
>     And processB is trying reclaim the page.
> 
>               processA                   |           processB
>     -------------------------------------+-------------------------------------
>       (page_remove_rmap())               |  (shrink_page_list())
>          mem_cgroup_uncharge_page()      |
>             ->uncharged because it's not |
>               PageSwapCache yet.         |
>               So, both mem/memsw.usage   |
>               are decremented.           |
>                                          |    add_to_swap() -> added to swap cache.
> 
>     If this page goes thorough without being freed for some reason, this page
>     doesn't goes back to memcg's LRU because of !PageCgroupUsed.
> 
> 
> Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
> swapin-readahead just read swp_entries which are near to requested entry. So,
> pages not to be used can be on memory (on global LRU). When memcg is used,
> this is not good behavior anyway.
> 
> Considering Type-2, the page should be freed from SwapCache right after WriteBack.
> Free swapped out pages as soon as possible is a good nature to memcg, anyway.
> 
> The patch set includes followng
>  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
>  [2/3] fix swap cache handling race by avoidng readahead.
>  [3/3] fix swap cache handling race by check swapcount again.
> 
> Result is good under my test.
> 
These patches seem to work well on my side too.


BTW, we need one more fix which I found in a long term test last week.
After this patch, it survived all through the weekend in my test.

I don't know why we've never hit this bug so far.
I think I hit it because my memcg_free_unused_swapcache() patch increases
the possibility of calling mem_cgroup_uncharge_swapcache().

Thanks,
Daisuke Nishimura.
===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

memcg: call mem_cgroup_uncharge_swapcache() outside of tree_lock

It's rare, but I hit a spinlock lockup.

    BUG: spinlock lockup on CPU#2, page01/24205, ffffffff806faf18
    Pid: 24205, comm: page01 Not tainted 2.6.30-rc4-5845621d #1
    Call Trace:
     <IRQ>  [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
     [<ffffffff803752bc>] ? _raw_spin_lock+0xfb/0x122
     [<ffffffff804ee9ba>] ? _spin_lock_irqsave+0x59/0x70
     [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
     [<ffffffff8029649c>] ? rotate_reclaimable_page+0x87/0x8e
     [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
     [<ffffffff8028d4ed>] ? end_page_writeback+0x1c/0x3d
     [<ffffffff802ac4c6>] ? end_swap_bio_write+0x57/0x65
     [<ffffffff803516d9>] ? __end_that_request_first+0x1f3/0x2e4
     [<ffffffff803515f2>] ? __end_that_request_first+0x10c/0x2e4
     [<ffffffff803517e5>] ? end_that_request_data+0x1b/0x4c
     [<ffffffff8035225a>] ? blk_end_io+0x1c/0x76
     [<ffffffffa0060702>] ? scsi_io_completion+0x1dc/0x467 [scsi_mod]
     [<ffffffff80356394>] ? blk_done_softirq+0x66/0x76
     [<ffffffff8024002e>] ? __do_softirq+0xae/0x182
     [<ffffffff8020cb3c>] ? call_softirq+0x1c/0x2a
     [<ffffffff8020de9a>] ? do_softirq+0x31/0x83
     [<ffffffff8020d57b>] ? do_IRQ+0xa9/0xbf
     [<ffffffff8020c393>] ? ret_from_intr+0x0/0xf
     <EOI>  [<ffffffff8026fd56>] ? res_counter_uncharge+0x67/0x70
     [<ffffffff802bf04a>] ? __mem_cgroup_uncharge_common+0xbd/0x158
     [<ffffffff802a0f55>] ? unmap_vmas+0x7ef/0x829
     [<ffffffff802a8a3a>] ? page_remove_rmap+0x1b/0x36
     [<ffffffff802a0c11>] ? unmap_vmas+0x4ab/0x829
     [<ffffffff802a5243>] ? exit_mmap+0xa7/0x11c
     [<ffffffff80239009>] ? mmput+0x41/0x9f
     [<ffffffff8023cf7b>] ? exit_mm+0x101/0x10c
     [<ffffffff8023e481>] ? do_exit+0x1a4/0x61e
     [<ffffffff80259391>] ? trace_hardirqs_on_caller+0x11d/0x148
     [<ffffffff8023e96e>] ? do_group_exit+0x73/0xa5
     [<ffffffff8023e9b2>] ? sys_exit_group+0x12/0x16
     [<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b

This is caused when:

CPU1: __mem_cgroup_uncharge_common(), which is holding page_cgroup lock,
      is interuppted and end_swap_bio_write(), which tries to hold
      swapper_space.tree_lock, is called in the interrupt context.
CPU2: mem_cgroup_uncharge_swapcache(), which is called under swapper_space.tree_lock,
      is spinning at lock_page_cgroup() in __mem_cgroup_uncharge_common().

IIUC, there is no need that mem_cgroup_uncharge_swapcache() should be called under
swapper_space.tree.lock, so move it outside the tree_lock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/swap.h |    5 +++++
 mm/memcontrol.c      |    2 +-
 mm/swap_state.c      |    4 +---
 mm/vmscan.c          |    1 +
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index caf0767..6ea541d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
 #define has_swap_token(x) 0
 #define disable_swap_token() do { } while(0)
 
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+
 #endif /* CONFIG_SWAP */
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c9c1ad..379f200 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1489,7 +1489,7 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
 }
 
 /*
- * called from __delete_from_swap_cache() and drop "page" account.
+ * called after __delete_from_swap_cache() and drop "page" account.
  * memcg information is recorded to swap_cgroup of "ent"
  */
 void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e389ef2..7624c89 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
  */
 void __delete_from_swap_cache(struct page *page)
 {
-	swp_entry_t ent = {.val = page_private(page)};
-
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapCache(page));
 	VM_BUG_ON(PageWriteback(page));
@@ -121,7 +119,6 @@ void __delete_from_swap_cache(struct page *page)
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
-	mem_cgroup_uncharge_swapcache(page, ent);
 }
 
 /**
@@ -191,6 +188,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&swapper_space.tree_lock);
 
+	mem_cgroup_uncharge_swapcache(page, entry);
 	swap_free(entry);
 	page_cache_release(page);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 337be66..e674cd1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -470,6 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_uncharge_swapcache(page, swap);
 		swap_free(swap);
 	} else {
 		__remove_from_page_cache(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/3] memcg: call uncharge_swapcache outside of tree_lock (Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7)
  2009-05-12  5:06   ` Daisuke Nishimura
@ 2009-05-12  7:09     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  7:09 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 14:06:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Tue, 12 May 2009 10:44:01 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I hope this version gets acks..
> > ==
> > As Nishimura reported, there is a race at handling swap cache.
> > 
> > Typical cases are following (from Nishimura's mail)
> > 
> > 
> > == Type-1 ==
> >   If some pages of processA has been swapped out, it calls free_swap_and_cache().
> >   And if at the same time, processB is calling read_swap_cache_async() about
> >   a swap entry *that is used by processA*, a race like below can happen.
> > 
> >             processA                   |           processB
> >   -------------------------------------+-------------------------------------
> >     (free_swap_and_cache())            |  (read_swap_cache_async())
> >                                        |    swap_duplicate()
> >                                        |    __set_page_locked()
> >                                        |    add_to_swap_cache()
> >       swap_entry_free() == 0           |
> >       find_get_page() -> found         |
> >       try_lock_page() -> fail & return |
> >                                        |    lru_cache_add_anon()
> >                                        |      doesn't link this page to memcg's
> >                                        |      LRU, because of !PageCgroupUsed.
> > 
> >   This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.
> > 
> > 
> > == Type-2 ==
> >     Assume processA is exiting and pte points to a page(!PageSwapCache).
> >     And processB is trying reclaim the page.
> > 
> >               processA                   |           processB
> >     -------------------------------------+-------------------------------------
> >       (page_remove_rmap())               |  (shrink_page_list())
> >          mem_cgroup_uncharge_page()      |
> >             ->uncharged because it's not |
> >               PageSwapCache yet.         |
> >               So, both mem/memsw.usage   |
> >               are decremented.           |
> >                                          |    add_to_swap() -> added to swap cache.
> > 
> >     If this page goes thorough without being freed for some reason, this page
> >     doesn't goes back to memcg's LRU because of !PageCgroupUsed.
> > 
> > 
> > Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
> > swapin-readahead just read swp_entries which are near to requested entry. So,
> > pages not to be used can be on memory (on global LRU). When memcg is used,
> > this is not good behavior anyway.
> > 
> > Considering Type-2, the page should be freed from SwapCache right after WriteBack.
> > Free swapped out pages as soon as possible is a good nature to memcg, anyway.
> > 
> > The patch set includes followng
> >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> >  [2/3] fix swap cache handling race by avoidng readahead.
> >  [3/3] fix swap cache handling race by check swapcount again.
> > 
> > Result is good under my test.
> > 
> These patches seem to work well on my side too.
> 
> 
> BTW, we need one more fix which I found in a long term test last week.
> After this patch, it survived all through the weekend in my test.
> 
> I don't know why we've never hit this bug so far.
> I think I hit it because my memcg_free_unused_swapcache() patch increases
> the possibility of calling mem_cgroup_uncharge_swapcache().
> 
> Thanks,
> Daisuke Nishimura.
> ===
> From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> memcg: call mem_cgroup_uncharge_swapcache() outside of tree_lock
> 
> It's rare, but I hit a spinlock lockup.
> 
>     BUG: spinlock lockup on CPU#2, page01/24205, ffffffff806faf18
>     Pid: 24205, comm: page01 Not tainted 2.6.30-rc4-5845621d #1
>     Call Trace:
>      <IRQ>  [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
>      [<ffffffff803752bc>] ? _raw_spin_lock+0xfb/0x122
>      [<ffffffff804ee9ba>] ? _spin_lock_irqsave+0x59/0x70
>      [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
>      [<ffffffff8029649c>] ? rotate_reclaimable_page+0x87/0x8e
>      [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
>      [<ffffffff8028d4ed>] ? end_page_writeback+0x1c/0x3d
>      [<ffffffff802ac4c6>] ? end_swap_bio_write+0x57/0x65
>      [<ffffffff803516d9>] ? __end_that_request_first+0x1f3/0x2e4
>      [<ffffffff803515f2>] ? __end_that_request_first+0x10c/0x2e4
>      [<ffffffff803517e5>] ? end_that_request_data+0x1b/0x4c
>      [<ffffffff8035225a>] ? blk_end_io+0x1c/0x76
>      [<ffffffffa0060702>] ? scsi_io_completion+0x1dc/0x467 [scsi_mod]
>      [<ffffffff80356394>] ? blk_done_softirq+0x66/0x76
>      [<ffffffff8024002e>] ? __do_softirq+0xae/0x182
>      [<ffffffff8020cb3c>] ? call_softirq+0x1c/0x2a
>      [<ffffffff8020de9a>] ? do_softirq+0x31/0x83
>      [<ffffffff8020d57b>] ? do_IRQ+0xa9/0xbf
>      [<ffffffff8020c393>] ? ret_from_intr+0x0/0xf
>      <EOI>  [<ffffffff8026fd56>] ? res_counter_uncharge+0x67/0x70
>      [<ffffffff802bf04a>] ? __mem_cgroup_uncharge_common+0xbd/0x158
>      [<ffffffff802a0f55>] ? unmap_vmas+0x7ef/0x829
>      [<ffffffff802a8a3a>] ? page_remove_rmap+0x1b/0x36
>      [<ffffffff802a0c11>] ? unmap_vmas+0x4ab/0x829
>      [<ffffffff802a5243>] ? exit_mmap+0xa7/0x11c
>      [<ffffffff80239009>] ? mmput+0x41/0x9f
>      [<ffffffff8023cf7b>] ? exit_mm+0x101/0x10c
>      [<ffffffff8023e481>] ? do_exit+0x1a4/0x61e
>      [<ffffffff80259391>] ? trace_hardirqs_on_caller+0x11d/0x148
>      [<ffffffff8023e96e>] ? do_group_exit+0x73/0xa5
>      [<ffffffff8023e9b2>] ? sys_exit_group+0x12/0x16
>      [<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b
> 
> This is caused when:
> 
> CPU1: __mem_cgroup_uncharge_common(), which is holding page_cgroup lock,
>       is interuppted and end_swap_bio_write(), which tries to hold
>       swapper_space.tree_lock, is called in the interrupt context.
> CPU2: mem_cgroup_uncharge_swapcache(), which is called under swapper_space.tree_lock,
>       is spinning at lock_page_cgroup() in __mem_cgroup_uncharge_common().
> 
> IIUC, there is no need that mem_cgroup_uncharge_swapcache() should be called under
> swapper_space.tree.lock, so move it outside the tree_lock.
> 

I understand the problem, but, wait a bit. NACK to this patch itself.

1. I placed _uncharge_ inside tree_lock because __remove_from_page_cache() does.
   (i.e. using the same logic.)
   So, plz change both logic at once.(change caller of  mem_cgroup_uncharge_cache_page())

2. Shouldn't we disable IRQ while __mem_cgroup_uncharge_common() rather than moving
   function ?

Thanks,
-Kame





> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/swap.h |    5 +++++
>  mm/memcontrol.c      |    2 +-
>  mm/swap_state.c      |    4 +---
>  mm/vmscan.c          |    1 +
>  4 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index caf0767..6ea541d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
>  #define has_swap_token(x) 0
>  #define disable_swap_token() do { } while(0)
>  
> +static inline void
> +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> +{
> +}
> +
>  #endif /* CONFIG_SWAP */
>  #endif /* __KERNEL__*/
>  #endif /* _LINUX_SWAP_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0c9c1ad..379f200 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1489,7 +1489,7 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
>  }
>  
>  /*
> - * called from __delete_from_swap_cache() and drop "page" account.
> + * called after __delete_from_swap_cache() and drop "page" account.
>   * memcg information is recorded to swap_cgroup of "ent"
>   */
>  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e389ef2..7624c89 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
>   */
>  void __delete_from_swap_cache(struct page *page)
>  {
> -	swp_entry_t ent = {.val = page_private(page)};
> -
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(!PageSwapCache(page));
>  	VM_BUG_ON(PageWriteback(page));
> @@ -121,7 +119,6 @@ void __delete_from_swap_cache(struct page *page)
>  	total_swapcache_pages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	INC_CACHE_INFO(del_total);
> -	mem_cgroup_uncharge_swapcache(page, ent);
>  }
>  
>  /**
> @@ -191,6 +188,7 @@ void delete_from_swap_cache(struct page *page)
>  	__delete_from_swap_cache(page);
>  	spin_unlock_irq(&swapper_space.tree_lock);
>  
> +	mem_cgroup_uncharge_swapcache(page, entry);
>  	swap_free(entry);
>  	page_cache_release(page);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 337be66..e674cd1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -470,6 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  		swp_entry_t swap = { .val = page_private(page) };
>  		__delete_from_swap_cache(page);
>  		spin_unlock_irq(&mapping->tree_lock);
> +		mem_cgroup_uncharge_swapcache(page, swap);
>  		swap_free(swap);
>  	} else {
>  		__remove_from_page_cache(page);
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/3] memcg: call uncharge_swapcache outside of tree_lock (Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7)
@ 2009-05-12  7:09     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  7:09 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 14:06:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Tue, 12 May 2009 10:44:01 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I hope this version gets acks..
> > ==
> > As Nishimura reported, there is a race at handling swap cache.
> > 
> > Typical cases are following (from Nishimura's mail)
> > 
> > 
> > == Type-1 ==
> >   If some pages of processA has been swapped out, it calls free_swap_and_cache().
> >   And if at the same time, processB is calling read_swap_cache_async() about
> >   a swap entry *that is used by processA*, a race like below can happen.
> > 
> >             processA                   |           processB
> >   -------------------------------------+-------------------------------------
> >     (free_swap_and_cache())            |  (read_swap_cache_async())
> >                                        |    swap_duplicate()
> >                                        |    __set_page_locked()
> >                                        |    add_to_swap_cache()
> >       swap_entry_free() == 0           |
> >       find_get_page() -> found         |
> >       try_lock_page() -> fail & return |
> >                                        |    lru_cache_add_anon()
> >                                        |      doesn't link this page to memcg's
> >                                        |      LRU, because of !PageCgroupUsed.
> > 
> >   This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.
> > 
> > 
> > == Type-2 ==
> >     Assume processA is exiting and pte points to a page(!PageSwapCache).
> >     And processB is trying reclaim the page.
> > 
> >               processA                   |           processB
> >     -------------------------------------+-------------------------------------
> >       (page_remove_rmap())               |  (shrink_page_list())
> >          mem_cgroup_uncharge_page()      |
> >             ->uncharged because it's not |
> >               PageSwapCache yet.         |
> >               So, both mem/memsw.usage   |
> >               are decremented.           |
> >                                          |    add_to_swap() -> added to swap cache.
> > 
> >     If this page goes thorough without being freed for some reason, this page
> >     doesn't goes back to memcg's LRU because of !PageCgroupUsed.
> > 
> > 
> > Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
> > swapin-readahead just read swp_entries which are near to requested entry. So,
> > pages not to be used can be on memory (on global LRU). When memcg is used,
> > this is not good behavior anyway.
> > 
> > Considering Type-2, the page should be freed from SwapCache right after WriteBack.
> > Free swapped out pages as soon as possible is a good nature to memcg, anyway.
> > 
> > The patch set includes followng
> >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> >  [2/3] fix swap cache handling race by avoidng readahead.
> >  [3/3] fix swap cache handling race by check swapcount again.
> > 
> > Result is good under my test.
> > 
> These patches seem to work well on my side too.
> 
> 
> BTW, we need one more fix which I found in a long term test last week.
> After this patch, it survived all through the weekend in my test.
> 
> I don't know why we've never hit this bug so far.
> I think I hit it because my memcg_free_unused_swapcache() patch increases
> the possibility of calling mem_cgroup_uncharge_swapcache().
> 
> Thanks,
> Daisuke Nishimura.
> ===
> From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> memcg: call mem_cgroup_uncharge_swapcache() outside of tree_lock
> 
> It's rare, but I hit a spinlock lockup.
> 
>     BUG: spinlock lockup on CPU#2, page01/24205, ffffffff806faf18
>     Pid: 24205, comm: page01 Not tainted 2.6.30-rc4-5845621d #1
>     Call Trace:
>      <IRQ>  [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
>      [<ffffffff803752bc>] ? _raw_spin_lock+0xfb/0x122
>      [<ffffffff804ee9ba>] ? _spin_lock_irqsave+0x59/0x70
>      [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
>      [<ffffffff8029649c>] ? rotate_reclaimable_page+0x87/0x8e
>      [<ffffffff80294f93>] ? test_clear_page_writeback+0x4d/0xff
>      [<ffffffff8028d4ed>] ? end_page_writeback+0x1c/0x3d
>      [<ffffffff802ac4c6>] ? end_swap_bio_write+0x57/0x65
>      [<ffffffff803516d9>] ? __end_that_request_first+0x1f3/0x2e4
>      [<ffffffff803515f2>] ? __end_that_request_first+0x10c/0x2e4
>      [<ffffffff803517e5>] ? end_that_request_data+0x1b/0x4c
>      [<ffffffff8035225a>] ? blk_end_io+0x1c/0x76
>      [<ffffffffa0060702>] ? scsi_io_completion+0x1dc/0x467 [scsi_mod]
>      [<ffffffff80356394>] ? blk_done_softirq+0x66/0x76
>      [<ffffffff8024002e>] ? __do_softirq+0xae/0x182
>      [<ffffffff8020cb3c>] ? call_softirq+0x1c/0x2a
>      [<ffffffff8020de9a>] ? do_softirq+0x31/0x83
>      [<ffffffff8020d57b>] ? do_IRQ+0xa9/0xbf
>      [<ffffffff8020c393>] ? ret_from_intr+0x0/0xf
>      <EOI>  [<ffffffff8026fd56>] ? res_counter_uncharge+0x67/0x70
>      [<ffffffff802bf04a>] ? __mem_cgroup_uncharge_common+0xbd/0x158
>      [<ffffffff802a0f55>] ? unmap_vmas+0x7ef/0x829
>      [<ffffffff802a8a3a>] ? page_remove_rmap+0x1b/0x36
>      [<ffffffff802a0c11>] ? unmap_vmas+0x4ab/0x829
>      [<ffffffff802a5243>] ? exit_mmap+0xa7/0x11c
>      [<ffffffff80239009>] ? mmput+0x41/0x9f
>      [<ffffffff8023cf7b>] ? exit_mm+0x101/0x10c
>      [<ffffffff8023e481>] ? do_exit+0x1a4/0x61e
>      [<ffffffff80259391>] ? trace_hardirqs_on_caller+0x11d/0x148
>      [<ffffffff8023e96e>] ? do_group_exit+0x73/0xa5
>      [<ffffffff8023e9b2>] ? sys_exit_group+0x12/0x16
>      [<ffffffff8020b96b>] ? system_call_fastpath+0x16/0x1b
> 
> This is caused when:
> 
> CPU1: __mem_cgroup_uncharge_common(), which is holding page_cgroup lock,
>       is interuppted and end_swap_bio_write(), which tries to hold
>       swapper_space.tree_lock, is called in the interrupt context.
> CPU2: mem_cgroup_uncharge_swapcache(), which is called under swapper_space.tree_lock,
>       is spinning at lock_page_cgroup() in __mem_cgroup_uncharge_common().
> 
> IIUC, there is no need that mem_cgroup_uncharge_swapcache() should be called under
> swapper_space.tree.lock, so move it outside the tree_lock.
> 

I understand the problem, but, wait a bit. NACK to this patch itself.

1. I placed _uncharge_ inside tree_lock because __remove_from_page_cache() does.
   (i.e. using the same logic.)
   So, plz change both logic at once.(change caller of  mem_cgroup_uncharge_cache_page())

2. Shouldn't we disable IRQ while __mem_cgroup_uncharge_common() rather than moving
   function ?

Thanks,
-Kame





> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/swap.h |    5 +++++
>  mm/memcontrol.c      |    2 +-
>  mm/swap_state.c      |    4 +---
>  mm/vmscan.c          |    1 +
>  4 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index caf0767..6ea541d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
>  #define has_swap_token(x) 0
>  #define disable_swap_token() do { } while(0)
>  
> +static inline void
> +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> +{
> +}
> +
>  #endif /* CONFIG_SWAP */
>  #endif /* __KERNEL__*/
>  #endif /* _LINUX_SWAP_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0c9c1ad..379f200 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1489,7 +1489,7 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
>  }
>  
>  /*
> - * called from __delete_from_swap_cache() and drop "page" account.
> + * called after __delete_from_swap_cache() and drop "page" account.
>   * memcg information is recorded to swap_cgroup of "ent"
>   */
>  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e389ef2..7624c89 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
>   */
>  void __delete_from_swap_cache(struct page *page)
>  {
> -	swp_entry_t ent = {.val = page_private(page)};
> -
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(!PageSwapCache(page));
>  	VM_BUG_ON(PageWriteback(page));
> @@ -121,7 +119,6 @@ void __delete_from_swap_cache(struct page *page)
>  	total_swapcache_pages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	INC_CACHE_INFO(del_total);
> -	mem_cgroup_uncharge_swapcache(page, ent);
>  }
>  
>  /**
> @@ -191,6 +188,7 @@ void delete_from_swap_cache(struct page *page)
>  	__delete_from_swap_cache(page);
>  	spin_unlock_irq(&swapper_space.tree_lock);
>  
> +	mem_cgroup_uncharge_swapcache(page, entry);
>  	swap_free(entry);
>  	page_cache_release(page);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 337be66..e674cd1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -470,6 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  		swp_entry_t swap = { .val = page_private(page) };
>  		__delete_from_swap_cache(page);
>  		spin_unlock_irq(&mapping->tree_lock);
> +		mem_cgroup_uncharge_swapcache(page, swap);
>  		swap_free(swap);
>  	} else {
>  		__remove_from_page_cache(page);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/3] memcg: call uncharge_swapcache outside of tree_lock (Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7)
  2009-05-12  7:09     ` KAMEZAWA Hiroyuki
@ 2009-05-12  8:00       ` Daisuke Nishimura
  -1 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-12  8:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

> I understand the problem, but, wait a bit. NACK to this patch itself.
> 
> 1. I placed _uncharge_ inside tree_lock because __remove_from_page_cache() does.
>    (i.e. using the same logic.)
>    So, plz change both logic at once.(change caller of  mem_cgroup_uncharge_cache_page())
> 
hmm, I see.
cache_charge is outside of tree_lock, so moving uncharge would make sense.
IMHO, we should make the period of spinlock as small as possible,
and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.

> 2. Shouldn't we disable IRQ while __mem_cgroup_uncharge_common() rather than moving
>    function ?
> 
Yes, this is another choise.
But, isn't it better to disable IRQ at all users of lock_page_cgroup..unlock_page_cgroup
to avoid this dead lock ?

Anyway, I'll postpone this fix for a while.
We should fix stale swap swapcache first.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/3] memcg: call uncharge_swapcache outside of tree_lock (Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7)
@ 2009-05-12  8:00       ` Daisuke Nishimura
  0 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-12  8:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

> I understand the problem, but, wait a bit. NACK to this patch itself.
> 
> 1. I placed _uncharge_ inside tree_lock because __remove_from_page_cache() does.
>    (i.e. using the same logic.)
>    So, plz change both logic at once.(change caller of  mem_cgroup_uncharge_cache_page())
> 
hmm, I see.
cache_charge is outside of tree_lock, so moving uncharge would make sense.
IMHO, we should make the period of spinlock as small as possible,
and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.

> 2. Shouldn't we disable IRQ while __mem_cgroup_uncharge_common() rather than moving
>    function ?
> 
Yes, this is another choise.
But, isn't it better to disable IRQ at all users of lock_page_cgroup..unlock_page_cgroup
to avoid this dead lock ?

Anyway, I'll postpone this fix for a while.
We should fix stale swap swapcache first.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
  2009-05-12  8:00       ` Daisuke Nishimura
@ 2009-05-12  8:13         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  8:13 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 17:00:07 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> hmm, I see.
> cache_charge is outside of tree_lock, so moving uncharge would make sense.
> IMHO, we should make the period of spinlock as small as possible,
> and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> 
How about this ?
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

As Nishimura pointed out, mapping->tree_lock can be aquired from interrupt
context. Then, following dead lock can occur.
Assume "A" as a page.

 CPU0:
       lock_page_cgroup(A)
		interrupted
			-> take mapping->tree_lock.
 CPU1:
       take mapping->tree_lock
		-> lock_page_cgroup(A)

This patch tries to fix above deadlock by moving memcg's hook to out of
mapping->tree_lock.

After this patch, lock_page_cgroup() is not called under mapping->tree_lock.

Making Nishimura's first fix more fundamanetal for avoiding to add special case.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/filemap.c    |    6 +++---
 mm/swap_state.c |    2 +-
 mm/truncate.c   |    1 +
 mm/vmscan.c     |    2 ++
 4 files changed, 7 insertions(+), 4 deletions(-)

Index: mmotm-2.6.30-May07/mm/filemap.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/filemap.c
+++ mmotm-2.6.30-May07/mm/filemap.c
@@ -121,7 +121,6 @@ void __remove_from_page_cache(struct pag
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	BUG_ON(page_mapped(page));
-	mem_cgroup_uncharge_cache_page(page);
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
@@ -145,6 +144,7 @@ void remove_from_page_cache(struct page 
 	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_uncharge_cache_page(page);
 }
 
 static int sync_page(void *word)
@@ -476,13 +476,13 @@ int add_to_page_cache_locked(struct page
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
+			spin_unlock_irq(&mapping->tree_lock);
 		} else {
 			page->mapping = NULL;
+			spin_unlock_irq(&mapping->tree_lock);
 			mem_cgroup_uncharge_cache_page(page);
 			page_cache_release(page);
 		}
-
-		spin_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
 	} else
 		mem_cgroup_uncharge_cache_page(page);
Index: mmotm-2.6.30-May07/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/swap_state.c
+++ mmotm-2.6.30-May07/mm/swap_state.c
@@ -121,7 +121,6 @@ void __delete_from_swap_cache(struct pag
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
-	mem_cgroup_uncharge_swapcache(page, ent);
 }
 
 /**
@@ -191,6 +190,7 @@ void delete_from_swap_cache(struct page 
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&swapper_space.tree_lock);
 
+	mem_cgroup_uncharge_swapcache(page, ent);
 	swap_free(entry);
 	page_cache_release(page);
 }
Index: mmotm-2.6.30-May07/mm/truncate.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/truncate.c
+++ mmotm-2.6.30-May07/mm/truncate.c
@@ -359,6 +359,7 @@ invalidate_complete_page2(struct address
 	BUG_ON(page_has_private(page));
 	__remove_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_uncharge_cache_page(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
Index: mmotm-2.6.30-May07/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/vmscan.c
+++ mmotm-2.6.30-May07/mm/vmscan.c
@@ -477,10 +477,12 @@ static int __remove_mapping(struct addre
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_uncharge_swapcache(page);
 		swap_free(swap);
 	} else {
 		__remove_from_page_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_uncharge_cache_page(page);
 	}
 
 	return 1;


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
@ 2009-05-12  8:13         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  8:13 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 17:00:07 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> hmm, I see.
> cache_charge is outside of tree_lock, so moving uncharge would make sense.
> IMHO, we should make the period of spinlock as small as possible,
> and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> 
How about this ?
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

As Nishimura pointed out, mapping->tree_lock can be aquired from interrupt
context. Then, following dead lock can occur.
Assume "A" as a page.

 CPU0:
       lock_page_cgroup(A)
		interrupted
			-> take mapping->tree_lock.
 CPU1:
       take mapping->tree_lock
		-> lock_page_cgroup(A)

This patch tries to fix above deadlock by moving memcg's hook to out of
mapping->tree_lock.

After this patch, lock_page_cgroup() is not called under mapping->tree_lock.

Making Nishimura's first fix more fundamanetal for avoiding to add special case.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/filemap.c    |    6 +++---
 mm/swap_state.c |    2 +-
 mm/truncate.c   |    1 +
 mm/vmscan.c     |    2 ++
 4 files changed, 7 insertions(+), 4 deletions(-)

Index: mmotm-2.6.30-May07/mm/filemap.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/filemap.c
+++ mmotm-2.6.30-May07/mm/filemap.c
@@ -121,7 +121,6 @@ void __remove_from_page_cache(struct pag
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	BUG_ON(page_mapped(page));
-	mem_cgroup_uncharge_cache_page(page);
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
@@ -145,6 +144,7 @@ void remove_from_page_cache(struct page 
 	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_uncharge_cache_page(page);
 }
 
 static int sync_page(void *word)
@@ -476,13 +476,13 @@ int add_to_page_cache_locked(struct page
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
+			spin_unlock_irq(&mapping->tree_lock);
 		} else {
 			page->mapping = NULL;
+			spin_unlock_irq(&mapping->tree_lock);
 			mem_cgroup_uncharge_cache_page(page);
 			page_cache_release(page);
 		}
-
-		spin_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
 	} else
 		mem_cgroup_uncharge_cache_page(page);
Index: mmotm-2.6.30-May07/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/swap_state.c
+++ mmotm-2.6.30-May07/mm/swap_state.c
@@ -121,7 +121,6 @@ void __delete_from_swap_cache(struct pag
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
-	mem_cgroup_uncharge_swapcache(page, ent);
 }
 
 /**
@@ -191,6 +190,7 @@ void delete_from_swap_cache(struct page 
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&swapper_space.tree_lock);
 
+	mem_cgroup_uncharge_swapcache(page, ent);
 	swap_free(entry);
 	page_cache_release(page);
 }
Index: mmotm-2.6.30-May07/mm/truncate.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/truncate.c
+++ mmotm-2.6.30-May07/mm/truncate.c
@@ -359,6 +359,7 @@ invalidate_complete_page2(struct address
 	BUG_ON(page_has_private(page));
 	__remove_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_uncharge_cache_page(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
Index: mmotm-2.6.30-May07/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-May07.orig/mm/vmscan.c
+++ mmotm-2.6.30-May07/mm/vmscan.c
@@ -477,10 +477,12 @@ static int __remove_mapping(struct addre
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_uncharge_swapcache(page);
 		swap_free(swap);
 	} else {
 		__remove_from_page_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_uncharge_cache_page(page);
 	}
 
 	return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
  2009-05-12  1:44 ` KAMEZAWA Hiroyuki
@ 2009-05-12  9:51   ` Balbir Singh
  -1 siblings, 0 replies; 52+ messages in thread
From: Balbir Singh @ 2009-05-12  9:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, akpm, mingo, linux-kernel

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-12 10:44:01]:

> I hope this version gets acks..
> ==
> As Nishimura reported, there is a race at handling swap cache.
> 
> Typical cases are following (from Nishimura's mail)
> 
> 
> == Type-1 ==
>   If some pages of processA has been swapped out, it calls free_swap_and_cache().
>   And if at the same time, processB is calling read_swap_cache_async() about
>   a swap entry *that is used by processA*, a race like below can happen.
> 
>             processA                   |           processB
>   -------------------------------------+-------------------------------------
>     (free_swap_and_cache())            |  (read_swap_cache_async())
>                                        |    swap_duplicate()
>                                        |    __set_page_locked()
>                                        |    add_to_swap_cache()
>       swap_entry_free() == 0           |
>       find_get_page() -> found         |
>       try_lock_page() -> fail & return |
>                                        |    lru_cache_add_anon()
>                                        |      doesn't link this page to memcg's
>                                        |      LRU, because of !PageCgroupUsed.
> 
>   This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.
> 
> 
> == Type-2 ==
>     Assume processA is exiting and pte points to a page(!PageSwapCache).
>     And processB is trying reclaim the page.
> 
>               processA                   |           processB
>     -------------------------------------+-------------------------------------
>       (page_remove_rmap())               |  (shrink_page_list())
>          mem_cgroup_uncharge_page()      |
>             ->uncharged because it's not |
>               PageSwapCache yet.         |
>               So, both mem/memsw.usage   |
>               are decremented.           |
>                                          |    add_to_swap() -> added to swap cache.
> 
>     If this page goes thorough without being freed for some reason, this page
>     doesn't goes back to memcg's LRU because of !PageCgroupUsed.
> 
> 
> Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
> swapin-readahead just read swp_entries which are near to requested entry. So,
> pages not to be used can be on memory (on global LRU). When memcg is used,
> this is not good behavior anyway.
> 
> Considering Type-2, the page should be freed from SwapCache right after WriteBack.
> Free swapped out pages as soon as possible is a good nature to memcg, anyway.
> 
> The patch set includes followng
>  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
>  [2/3] fix swap cache handling race by avoidng readahead.
>  [3/3] fix swap cache handling race by check swapcount again.
> 
> Result is good under my test.

What was the result (performance data impact) of disabling swap
readahead? Otherwise, this looks the most reasonable set of patches
for this problem.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
@ 2009-05-12  9:51   ` Balbir Singh
  0 siblings, 0 replies; 52+ messages in thread
From: Balbir Singh @ 2009-05-12  9:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, akpm, mingo, linux-kernel

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-12 10:44:01]:

> I hope this version gets acks..
> ==
> As Nishimura reported, there is a race at handling swap cache.
> 
> Typical cases are following (from Nishimura's mail)
> 
> 
> == Type-1 ==
>   If some pages of processA has been swapped out, it calls free_swap_and_cache().
>   And if at the same time, processB is calling read_swap_cache_async() about
>   a swap entry *that is used by processA*, a race like below can happen.
> 
>             processA                   |           processB
>   -------------------------------------+-------------------------------------
>     (free_swap_and_cache())            |  (read_swap_cache_async())
>                                        |    swap_duplicate()
>                                        |    __set_page_locked()
>                                        |    add_to_swap_cache()
>       swap_entry_free() == 0           |
>       find_get_page() -> found         |
>       try_lock_page() -> fail & return |
>                                        |    lru_cache_add_anon()
>                                        |      doesn't link this page to memcg's
>                                        |      LRU, because of !PageCgroupUsed.
> 
>   This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.
> 
> 
> == Type-2 ==
>     Assume processA is exiting and pte points to a page(!PageSwapCache).
>     And processB is trying reclaim the page.
> 
>               processA                   |           processB
>     -------------------------------------+-------------------------------------
>       (page_remove_rmap())               |  (shrink_page_list())
>          mem_cgroup_uncharge_page()      |
>             ->uncharged because it's not |
>               PageSwapCache yet.         |
>               So, both mem/memsw.usage   |
>               are decremented.           |
>                                          |    add_to_swap() -> added to swap cache.
> 
>     If this page goes thorough without being freed for some reason, this page
>     doesn't goes back to memcg's LRU because of !PageCgroupUsed.
> 
> 
> Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
> swapin-readahead just read swp_entries which are near to requested entry. So,
> pages not to be used can be on memory (on global LRU). When memcg is used,
> this is not good behavior anyway.
> 
> Considering Type-2, the page should be freed from SwapCache right after WriteBack.
> Free swapped out pages as soon as possible is a good nature to memcg, anyway.
> 
> The patch set includes followng
>  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
>  [2/3] fix swap cache handling race by avoidng readahead.
>  [3/3] fix swap cache handling race by check swapcount again.
> 
> Result is good under my test.

What was the result (performance data impact) of disabling swap
readahead? Otherwise, this looks the most reasonable set of patches
for this problem.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
  2009-05-12  8:13         ` KAMEZAWA Hiroyuki
@ 2009-05-12 10:58           ` Daisuke Nishimura
  -1 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-12 10:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: d-nishimura, Daisuke Nishimura, linux-mm, balbir, akpm, mingo,
	linux-kernel

On Tue, 12 May 2009 17:13:56 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 12 May 2009 17:00:07 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > hmm, I see.
> > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > IMHO, we should make the period of spinlock as small as possible,
> > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > 
> How about this ?
Looks good conceptually, but it cannot be built :)

It needs a fix like this.
Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
and CONFIG_SWAP.

===
 include/linux/swap.h |    5 +++++
 mm/memcontrol.c      |    4 +++-
 mm/swap_state.c      |    4 +---
 mm/vmscan.c          |    2 +-
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index caf0767..6ea541d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
 #define has_swap_token(x) 0
 #define disable_swap_token() do { } while(0)
 
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+
 #endif /* CONFIG_SWAP */
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c9c1ad..89523cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+#ifdef CONFIG_SWAP
 /*
- * called from __delete_from_swap_cache() and drop "page" account.
+ * called after __delete_from_swap_cache() and drop "page" account.
  * memcg information is recorded to swap_cgroup of "ent"
  */
 void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
@@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 	if (memcg)
 		css_put(&memcg->css);
 }
+#endif
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 /*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 87f10d4..7624c89 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
  */
 void __delete_from_swap_cache(struct page *page)
 {
-	swp_entry_t ent = {.val = page_private(page)};
-
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapCache(page));
 	VM_BUG_ON(PageWriteback(page));
@@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&swapper_space.tree_lock);
 
-	mem_cgroup_uncharge_swapcache(page, ent);
+	mem_cgroup_uncharge_swapcache(page, entry);
 	swap_free(entry);
 	page_cache_release(page);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6c5988d..a7d7a06 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_swapcache(page);
+		mem_cgroup_uncharge_swapcache(page, swap);
 		swap_free(swap);
 	} else {
 		__remove_from_page_cache(page);
===

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
@ 2009-05-12 10:58           ` Daisuke Nishimura
  0 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-12 10:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: d-nishimura, Daisuke Nishimura, linux-mm, balbir, akpm, mingo,
	linux-kernel

On Tue, 12 May 2009 17:13:56 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 12 May 2009 17:00:07 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > hmm, I see.
> > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > IMHO, we should make the period of spinlock as small as possible,
> > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > 
> How about this ?
Looks good conceptually, but it cannot be built :)

It needs a fix like this.
Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
and CONFIG_SWAP.

===
 include/linux/swap.h |    5 +++++
 mm/memcontrol.c      |    4 +++-
 mm/swap_state.c      |    4 +---
 mm/vmscan.c          |    2 +-
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index caf0767..6ea541d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
 #define has_swap_token(x) 0
 #define disable_swap_token() do { } while(0)
 
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+
 #endif /* CONFIG_SWAP */
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c9c1ad..89523cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+#ifdef CONFIG_SWAP
 /*
- * called from __delete_from_swap_cache() and drop "page" account.
+ * called after __delete_from_swap_cache() and drop "page" account.
  * memcg information is recorded to swap_cgroup of "ent"
  */
 void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
@@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 	if (memcg)
 		css_put(&memcg->css);
 }
+#endif
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 /*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 87f10d4..7624c89 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
  */
 void __delete_from_swap_cache(struct page *page)
 {
-	swp_entry_t ent = {.val = page_private(page)};
-
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapCache(page));
 	VM_BUG_ON(PageWriteback(page));
@@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&swapper_space.tree_lock);
 
-	mem_cgroup_uncharge_swapcache(page, ent);
+	mem_cgroup_uncharge_swapcache(page, entry);
 	swap_free(entry);
 	page_cache_release(page);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6c5988d..a7d7a06 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_swapcache(page);
+		mem_cgroup_uncharge_swapcache(page, swap);
 		swap_free(swap);
 	} else {
 		__remove_from_page_cache(page);
===

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
  2009-05-12  1:46   ` KAMEZAWA Hiroyuki
@ 2009-05-12 11:24     ` Johannes Weiner
  -1 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2009-05-12 11:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, Linux's swp_entry handling is done by combination of lazy techniques
> and global LRU. It works well but when we use mem+swap controller, some more
> strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> never freed until global LRU works. In a system where memcg is well-configured,
> global LRU doesn't work frequently.
> 
>   Example) Assume swapin-readahead.
> 	      CPU0			      CPU1
> 	   zap_pte()			  read_swap_cache_async()
> 					  swap_duplicate().
>            swap_entry_free() = 1
> 	   find_get_page()=> NULL.
> 					  add_to_swap_cache().
> 					  issue swap I/O. 
> 
> There are many patterns of this kind of race (but no problems).
> 
> free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> function. If the swp_entry/page seems busy, swp_entry is not freed.
> This is not a problem because global-LRU will find SwapCache at page reclaim.
> 
> If memcg is used, on the other hand, global LRU may not work. Then, above
> unused SwapCache will not be freed.
> (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> 
> So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> In bad case, OOM by mem+swap controller is triggered by this "leak" of
> swp_entry as Nishimura reported.
> 
> Considering this issue, swapin-readahead itself is not very good for memcg.
> It read swap cache which will not be used. (and _unused_ swapcache will
> not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> we need to account page to several _unrelated_ memcg. This is bad.
> 
> This patch tries to fix racy case of free_swap_and_cache() and page status.
> 
> After this patch applied, following test works well.
> 
>   # echo 1-2M > ../memory.limit_in_bytes
>   # run tasks under memcg.
>   # kill all tasks and make memory.tasks empty
>   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
>     there is no _used_ swp_entry.
> 
> What this patch does is
>  - avoid swapin-readahead when memcg is activated.
> 
> Changelog: v6 -> v7
>  - just handle races in readahead.
>  - races in writeback is handled in the next patch.
> 
> Changelog: v5 -> v6
>  - works only when memcg is activated.
>  - check after I/O works only after writeback.
>  - avoid swapin-readahead when memcg is activated.
>  - fixed page refcnt issue.
> Changelog: v4->v5
>  - completely new design.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/swap_state.c |   20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> Index: mmotm-2.6.30-May07/mm/swap_state.c
> ===================================================================
> --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> +++ mmotm-2.6.30-May07/mm/swap_state.c
> @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	int nr_pages;
> +	int nr_pages = 1;
>  	struct page *page;
> -	unsigned long offset;
> +	unsigned long offset = 0;
>  	unsigned long end_offset;
>  
>  	/*
> @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
>  	 * No, it's very unlikely that swap layout would follow vma layout,
>  	 * more likely that neighbouring swap pages came from the same node:
>  	 * so use the same "addr" to choose the same node for each swap read.
> +	 *
> +	 * But, when memcg is used, swapin readahead give us some bad
> +	 * effects. There are 2 big problems in general.
> +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> +	 *    And _not required_ memory is only freed by global LRU.
> +	 * 2. We can't charge pages for swap-cache readahead because
> +	 *    we should avoid account memory in a cgroup which a
> +	 *    thread call this function is not related to.
> +	 * And swapin-readahead have racy condition with
> +	 * free_swap_and_cache(). This also annoys memcg.
> +	 * Then, if memcg is really used, we avoid readahead.
>  	 */
> -	nr_pages = valid_swaphandles(entry, &offset);
> +
> +	if (!mem_cgroup_activated())
> +		nr_pages = valid_swaphandles(entry, &offset);
> +
>  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),

Having nr_pages set to 1 and offset to zero will actually enter hat
loop and try to read a swap slot at offset zero, including a
superfluous page allocation, just to fail at the swap_duplicate()
(swap slot 0 is swap header -> SWAP_MAP_BAD).

How about:

	if (mem_cgroup_activated())
		goto pivot;
	nr_pages = valid_swaphandles(...);
	for (readahead loop)
		...
pivot:
	return read_swap_cache_async();

That will also save you the runtime initialization of nr_pages and
offset completely when the cgroup is active.  And you'll have only one
branch and no second one for offset < end_offset in the loop.  And the
lru draining, but I'm not sure about that.  I think it's not needed.

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
@ 2009-05-12 11:24     ` Johannes Weiner
  0 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2009-05-12 11:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, Linux's swp_entry handling is done by combination of lazy techniques
> and global LRU. It works well but when we use mem+swap controller, some more
> strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> never freed until global LRU works. In a system where memcg is well-configured,
> global LRU doesn't work frequently.
> 
>   Example) Assume swapin-readahead.
> 	      CPU0			      CPU1
> 	   zap_pte()			  read_swap_cache_async()
> 					  swap_duplicate().
>            swap_entry_free() = 1
> 	   find_get_page()=> NULL.
> 					  add_to_swap_cache().
> 					  issue swap I/O. 
> 
> There are many patterns of this kind of race (but no problems).
> 
> free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> function. If the swp_entry/page seems busy, swp_entry is not freed.
> This is not a problem because global-LRU will find SwapCache at page reclaim.
> 
> If memcg is used, on the other hand, global LRU may not work. Then, above
> unused SwapCache will not be freed.
> (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> 
> So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> In bad case, OOM by mem+swap controller is triggered by this "leak" of
> swp_entry as Nishimura reported.
> 
> Considering this issue, swapin-readahead itself is not very good for memcg.
> It read swap cache which will not be used. (and _unused_ swapcache will
> not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> we need to account page to several _unrelated_ memcg. This is bad.
> 
> This patch tries to fix racy case of free_swap_and_cache() and page status.
> 
> After this patch applied, following test works well.
> 
>   # echo 1-2M > ../memory.limit_in_bytes
>   # run tasks under memcg.
>   # kill all tasks and make memory.tasks empty
>   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
>     there is no _used_ swp_entry.
> 
> What this patch does is
>  - avoid swapin-readahead when memcg is activated.
> 
> Changelog: v6 -> v7
>  - just handle races in readahead.
>  - races in writeback is handled in the next patch.
> 
> Changelog: v5 -> v6
>  - works only when memcg is activated.
>  - check after I/O works only after writeback.
>  - avoid swapin-readahead when memcg is activated.
>  - fixed page refcnt issue.
> Changelog: v4->v5
>  - completely new design.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/swap_state.c |   20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> Index: mmotm-2.6.30-May07/mm/swap_state.c
> ===================================================================
> --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> +++ mmotm-2.6.30-May07/mm/swap_state.c
> @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	int nr_pages;
> +	int nr_pages = 1;
>  	struct page *page;
> -	unsigned long offset;
> +	unsigned long offset = 0;
>  	unsigned long end_offset;
>  
>  	/*
> @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
>  	 * No, it's very unlikely that swap layout would follow vma layout,
>  	 * more likely that neighbouring swap pages came from the same node:
>  	 * so use the same "addr" to choose the same node for each swap read.
> +	 *
> +	 * But, when memcg is used, swapin readahead give us some bad
> +	 * effects. There are 2 big problems in general.
> +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> +	 *    And _not required_ memory is only freed by global LRU.
> +	 * 2. We can't charge pages for swap-cache readahead because
> +	 *    we should avoid account memory in a cgroup which a
> +	 *    thread call this function is not related to.
> +	 * And swapin-readahead have racy condition with
> +	 * free_swap_and_cache(). This also annoys memcg.
> +	 * Then, if memcg is really used, we avoid readahead.
>  	 */
> -	nr_pages = valid_swaphandles(entry, &offset);
> +
> +	if (!mem_cgroup_activated())
> +		nr_pages = valid_swaphandles(entry, &offset);
> +
>  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),

Having nr_pages set to 1 and offset to zero will actually enter hat
loop and try to read a swap slot at offset zero, including a
superfluous page allocation, just to fail at the swap_duplicate()
(swap slot 0 is swap header -> SWAP_MAP_BAD).

How about:

	if (mem_cgroup_activated())
		goto pivot;
	nr_pages = valid_swaphandles(...);
	for (readahead loop)
		...
pivot:
	return read_swap_cache_async();

That will also save you the runtime initialization of nr_pages and
offset completely when the cgroup is active.  And you'll have only one
branch and no second one for offset < end_offset in the loop.  And the
lru draining, but I'm not sure about that.  I think it's not needed.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
  2009-05-12 11:24     ` Johannes Weiner
@ 2009-05-12 23:58       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12 23:58 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

On Tue, 12 May 2009 13:24:00 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > In general, Linux's swp_entry handling is done by combination of lazy techniques
> > and global LRU. It works well but when we use mem+swap controller, some more
> > strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> > never freed until global LRU works. In a system where memcg is well-configured,
> > global LRU doesn't work frequently.
> > 
> >   Example) Assume swapin-readahead.
> > 	      CPU0			      CPU1
> > 	   zap_pte()			  read_swap_cache_async()
> > 					  swap_duplicate().
> >            swap_entry_free() = 1
> > 	   find_get_page()=> NULL.
> > 					  add_to_swap_cache().
> > 					  issue swap I/O. 
> > 
> > There are many patterns of this kind of race (but no problems).
> > 
> > free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> > function. If the swp_entry/page seems busy, swp_entry is not freed.
> > This is not a problem because global-LRU will find SwapCache at page reclaim.
> > 
> > If memcg is used, on the other hand, global LRU may not work. Then, above
> > unused SwapCache will not be freed.
> > (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> > 
> > So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> > In bad case, OOM by mem+swap controller is triggered by this "leak" of
> > swp_entry as Nishimura reported.
> > 
> > Considering this issue, swapin-readahead itself is not very good for memcg.
> > It read swap cache which will not be used. (and _unused_ swapcache will
> > not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> > we need to account page to several _unrelated_ memcg. This is bad.
> > 
> > This patch tries to fix racy case of free_swap_and_cache() and page status.
> > 
> > After this patch applied, following test works well.
> > 
> >   # echo 1-2M > ../memory.limit_in_bytes
> >   # run tasks under memcg.
> >   # kill all tasks and make memory.tasks empty
> >   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
> >     there is no _used_ swp_entry.
> > 
> > What this patch does is
> >  - avoid swapin-readahead when memcg is activated.
> > 
> > Changelog: v6 -> v7
> >  - just handle races in readahead.
> >  - races in writeback is handled in the next patch.
> > 
> > Changelog: v5 -> v6
> >  - works only when memcg is activated.
> >  - check after I/O works only after writeback.
> >  - avoid swapin-readahead when memcg is activated.
> >  - fixed page refcnt issue.
> > Changelog: v4->v5
> >  - completely new design.
> > 
> > Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/swap_state.c |   20 +++++++++++++++++---
> >  1 file changed, 17 insertions(+), 3 deletions(-)
> > 
> > Index: mmotm-2.6.30-May07/mm/swap_state.c
> > ===================================================================
> > --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> > +++ mmotm-2.6.30-May07/mm/swap_state.c
> > @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >  			struct vm_area_struct *vma, unsigned long addr)
> >  {
> > -	int nr_pages;
> > +	int nr_pages = 1;
> >  	struct page *page;
> > -	unsigned long offset;
> > +	unsigned long offset = 0;
> >  	unsigned long end_offset;
> >  
> >  	/*
> > @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
> >  	 * No, it's very unlikely that swap layout would follow vma layout,
> >  	 * more likely that neighbouring swap pages came from the same node:
> >  	 * so use the same "addr" to choose the same node for each swap read.
> > +	 *
> > +	 * But, when memcg is used, swapin readahead give us some bad
> > +	 * effects. There are 2 big problems in general.
> > +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> > +	 *    And _not required_ memory is only freed by global LRU.
> > +	 * 2. We can't charge pages for swap-cache readahead because
> > +	 *    we should avoid account memory in a cgroup which a
> > +	 *    thread call this function is not related to.
> > +	 * And swapin-readahead have racy condition with
> > +	 * free_swap_and_cache(). This also annoys memcg.
> > +	 * Then, if memcg is really used, we avoid readahead.
> >  	 */
> > -	nr_pages = valid_swaphandles(entry, &offset);
> > +
> > +	if (!mem_cgroup_activated())
> > +		nr_pages = valid_swaphandles(entry, &offset);
> > +
> >  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> >  		/* Ok, do the async read-ahead now */
> >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> 
> Having nr_pages set to 1 and offset to zero will actually enter hat
> loop and try to read a swap slot at offset zero, including a
> superfluous page allocation, just to fail at the swap_duplicate()
> (swap slot 0 is swap header -> SWAP_MAP_BAD).
> 
Hmm ?
 swp_entry(swp_type(entry), offset),
can be zero ?

> How about:
> 
> 	if (mem_cgroup_activated())
> 		goto pivot;
> 	nr_pages = valid_swaphandles(...);
> 	for (readahead loop)
> 		...
> pivot:
> 	return read_swap_cache_async();
> 
> That will also save you the runtime initialization of nr_pages and
> offset completely when the cgroup is active.  And you'll have only one
> branch and no second one for offset < end_offset in the loop.  And the
> lru draining, but I'm not sure about that.  I think it's not needed.
> 
Hmm. I'm not sure why lru_add_drain()->read_swap_cache_async() is inserted before returing
to caller. Is the page to be returned isn't necessary to be on LRU ?

Thanks,
-Kame



> 	Hannes
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
@ 2009-05-12 23:58       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12 23:58 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

On Tue, 12 May 2009 13:24:00 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > In general, Linux's swp_entry handling is done by combination of lazy techniques
> > and global LRU. It works well but when we use mem+swap controller, some more
> > strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> > never freed until global LRU works. In a system where memcg is well-configured,
> > global LRU doesn't work frequently.
> > 
> >   Example) Assume swapin-readahead.
> > 	      CPU0			      CPU1
> > 	   zap_pte()			  read_swap_cache_async()
> > 					  swap_duplicate().
> >            swap_entry_free() = 1
> > 	   find_get_page()=> NULL.
> > 					  add_to_swap_cache().
> > 					  issue swap I/O. 
> > 
> > There are many patterns of this kind of race (but no problems).
> > 
> > free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> > function. If the swp_entry/page seems busy, swp_entry is not freed.
> > This is not a problem because global-LRU will find SwapCache at page reclaim.
> > 
> > If memcg is used, on the other hand, global LRU may not work. Then, above
> > unused SwapCache will not be freed.
> > (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> > 
> > So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> > In bad case, OOM by mem+swap controller is triggered by this "leak" of
> > swp_entry as Nishimura reported.
> > 
> > Considering this issue, swapin-readahead itself is not very good for memcg.
> > It read swap cache which will not be used. (and _unused_ swapcache will
> > not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> > we need to account page to several _unrelated_ memcg. This is bad.
> > 
> > This patch tries to fix racy case of free_swap_and_cache() and page status.
> > 
> > After this patch applied, following test works well.
> > 
> >   # echo 1-2M > ../memory.limit_in_bytes
> >   # run tasks under memcg.
> >   # kill all tasks and make memory.tasks empty
> >   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
> >     there is no _used_ swp_entry.
> > 
> > What this patch does is
> >  - avoid swapin-readahead when memcg is activated.
> > 
> > Changelog: v6 -> v7
> >  - just handle races in readahead.
> >  - races in writeback is handled in the next patch.
> > 
> > Changelog: v5 -> v6
> >  - works only when memcg is activated.
> >  - check after I/O works only after writeback.
> >  - avoid swapin-readahead when memcg is activated.
> >  - fixed page refcnt issue.
> > Changelog: v4->v5
> >  - completely new design.
> > 
> > Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/swap_state.c |   20 +++++++++++++++++---
> >  1 file changed, 17 insertions(+), 3 deletions(-)
> > 
> > Index: mmotm-2.6.30-May07/mm/swap_state.c
> > ===================================================================
> > --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> > +++ mmotm-2.6.30-May07/mm/swap_state.c
> > @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >  			struct vm_area_struct *vma, unsigned long addr)
> >  {
> > -	int nr_pages;
> > +	int nr_pages = 1;
> >  	struct page *page;
> > -	unsigned long offset;
> > +	unsigned long offset = 0;
> >  	unsigned long end_offset;
> >  
> >  	/*
> > @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
> >  	 * No, it's very unlikely that swap layout would follow vma layout,
> >  	 * more likely that neighbouring swap pages came from the same node:
> >  	 * so use the same "addr" to choose the same node for each swap read.
> > +	 *
> > +	 * But, when memcg is used, swapin readahead give us some bad
> > +	 * effects. There are 2 big problems in general.
> > +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> > +	 *    And _not required_ memory is only freed by global LRU.
> > +	 * 2. We can't charge pages for swap-cache readahead because
> > +	 *    we should avoid account memory in a cgroup which a
> > +	 *    thread call this function is not related to.
> > +	 * And swapin-readahead have racy condition with
> > +	 * free_swap_and_cache(). This also annoys memcg.
> > +	 * Then, if memcg is really used, we avoid readahead.
> >  	 */
> > -	nr_pages = valid_swaphandles(entry, &offset);
> > +
> > +	if (!mem_cgroup_activated())
> > +		nr_pages = valid_swaphandles(entry, &offset);
> > +
> >  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> >  		/* Ok, do the async read-ahead now */
> >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> 
> Having nr_pages set to 1 and offset to zero will actually enter hat
> loop and try to read a swap slot at offset zero, including a
> superfluous page allocation, just to fail at the swap_duplicate()
> (swap slot 0 is swap header -> SWAP_MAP_BAD).
> 
Hmm ?
 swp_entry(swp_type(entry), offset),
can be zero ?

> How about:
> 
> 	if (mem_cgroup_activated())
> 		goto pivot;
> 	nr_pages = valid_swaphandles(...);
> 	for (readahead loop)
> 		...
> pivot:
> 	return read_swap_cache_async();
> 
> That will also save you the runtime initialization of nr_pages and
> offset completely when the cgroup is active.  And you'll have only one
> branch and no second one for offset < end_offset in the loop.  And the
> lru draining, but I'm not sure about that.  I think it's not needed.
> 
Hmm. I'm not sure why lru_add_drain()->read_swap_cache_async() is inserted before returing
to caller. Is the page to be returned isn't necessary to be on LRU ?

Thanks,
-Kame



> 	Hannes
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
  2009-05-12 10:58           ` Daisuke Nishimura
@ 2009-05-12 23:59             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12 23:59 UTC (permalink / raw)
  To: nishimura; +Cc: Daisuke Nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 19:58:23 +0900
Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:

> On Tue, 12 May 2009 17:13:56 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 12 May 2009 17:00:07 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > hmm, I see.
> > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > > IMHO, we should make the period of spinlock as small as possible,
> > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > > 
> > How about this ?
> Looks good conceptually, but it cannot be built :)
> 
> It needs a fix like this.
> Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
> and CONFIG_SWAP.
> 
ok, will update. can I add you Signed-off-by on the patch ?

Thanks,
-Kame
> ===
>  include/linux/swap.h |    5 +++++
>  mm/memcontrol.c      |    4 +++-
>  mm/swap_state.c      |    4 +---
>  mm/vmscan.c          |    2 +-
>  4 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index caf0767..6ea541d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
>  #define has_swap_token(x) 0
>  #define disable_swap_token() do { } while(0)
>  
> +static inline void
> +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> +{
> +}
> +
>  #endif /* CONFIG_SWAP */
>  #endif /* __KERNEL__*/
>  #endif /* _LINUX_SWAP_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0c9c1ad..89523cf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>  }
>  
> +#ifdef CONFIG_SWAP
>  /*
> - * called from __delete_from_swap_cache() and drop "page" account.
> + * called after __delete_from_swap_cache() and drop "page" account.
>   * memcg information is recorded to swap_cgroup of "ent"
>   */
>  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> @@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
>  	if (memcg)
>  		css_put(&memcg->css);
>  }
> +#endif
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>  /*
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 87f10d4..7624c89 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
>   */
>  void __delete_from_swap_cache(struct page *page)
>  {
> -	swp_entry_t ent = {.val = page_private(page)};
> -
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(!PageSwapCache(page));
>  	VM_BUG_ON(PageWriteback(page));
> @@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
>  	__delete_from_swap_cache(page);
>  	spin_unlock_irq(&swapper_space.tree_lock);
>  
> -	mem_cgroup_uncharge_swapcache(page, ent);
> +	mem_cgroup_uncharge_swapcache(page, entry);
>  	swap_free(entry);
>  	page_cache_release(page);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6c5988d..a7d7a06 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  		swp_entry_t swap = { .val = page_private(page) };
>  		__delete_from_swap_cache(page);
>  		spin_unlock_irq(&mapping->tree_lock);
> -		mem_cgroup_uncharge_swapcache(page);
> +		mem_cgroup_uncharge_swapcache(page, swap);
>  		swap_free(swap);
>  	} else {
>  		__remove_from_page_cache(page);
> ===
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
@ 2009-05-12 23:59             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12 23:59 UTC (permalink / raw)
  To: nishimura; +Cc: Daisuke Nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

On Tue, 12 May 2009 19:58:23 +0900
Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:

> On Tue, 12 May 2009 17:13:56 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 12 May 2009 17:00:07 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > hmm, I see.
> > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > > IMHO, we should make the period of spinlock as small as possible,
> > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > > 
> > How about this ?
> Looks good conceptually, but it cannot be built :)
> 
> It needs a fix like this.
> Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
> and CONFIG_SWAP.
> 
ok, will update. can I add you Signed-off-by on the patch ?

Thanks,
-Kame
> ===
>  include/linux/swap.h |    5 +++++
>  mm/memcontrol.c      |    4 +++-
>  mm/swap_state.c      |    4 +---
>  mm/vmscan.c          |    2 +-
>  4 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index caf0767..6ea541d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
>  #define has_swap_token(x) 0
>  #define disable_swap_token() do { } while(0)
>  
> +static inline void
> +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> +{
> +}
> +
>  #endif /* CONFIG_SWAP */
>  #endif /* __KERNEL__*/
>  #endif /* _LINUX_SWAP_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0c9c1ad..89523cf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>  }
>  
> +#ifdef CONFIG_SWAP
>  /*
> - * called from __delete_from_swap_cache() and drop "page" account.
> + * called after __delete_from_swap_cache() and drop "page" account.
>   * memcg information is recorded to swap_cgroup of "ent"
>   */
>  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> @@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
>  	if (memcg)
>  		css_put(&memcg->css);
>  }
> +#endif
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>  /*
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 87f10d4..7624c89 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
>   */
>  void __delete_from_swap_cache(struct page *page)
>  {
> -	swp_entry_t ent = {.val = page_private(page)};
> -
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(!PageSwapCache(page));
>  	VM_BUG_ON(PageWriteback(page));
> @@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
>  	__delete_from_swap_cache(page);
>  	spin_unlock_irq(&swapper_space.tree_lock);
>  
> -	mem_cgroup_uncharge_swapcache(page, ent);
> +	mem_cgroup_uncharge_swapcache(page, entry);
>  	swap_free(entry);
>  	page_cache_release(page);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6c5988d..a7d7a06 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  		swp_entry_t swap = { .val = page_private(page) };
>  		__delete_from_swap_cache(page);
>  		spin_unlock_irq(&mapping->tree_lock);
> -		mem_cgroup_uncharge_swapcache(page);
> +		mem_cgroup_uncharge_swapcache(page, swap);
>  		swap_free(swap);
>  	} else {
>  		__remove_from_page_cache(page);
> ===
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
  2009-05-12 23:59             ` KAMEZAWA Hiroyuki
@ 2009-05-13  0:28               ` Daisuke Nishimura
  -1 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-13  0:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, balbir, akpm, mingo,
	linux-kernel

On Wed, 13 May 2009 08:59:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 12 May 2009 19:58:23 +0900
> Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:
> 
> > On Tue, 12 May 2009 17:13:56 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Tue, 12 May 2009 17:00:07 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > hmm, I see.
> > > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > > > IMHO, we should make the period of spinlock as small as possible,
> > > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > > > 
> > > How about this ?
> > Looks good conceptually, but it cannot be built :)
> > 
> > It needs a fix like this.
> > Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
> > and CONFIG_SWAP.
> > 
> ok, will update. can I add you Signed-off-by on the patch ?
> 
Sure.

	Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

The patch(with my fix applied) seems to work fine, I need run it
for more long time though.

> Thanks,
> -Kame
> > ===
> >  include/linux/swap.h |    5 +++++
> >  mm/memcontrol.c      |    4 +++-
> >  mm/swap_state.c      |    4 +---
> >  mm/vmscan.c          |    2 +-
> >  4 files changed, 10 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index caf0767..6ea541d 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
> >  #define has_swap_token(x) 0
> >  #define disable_swap_token() do { } while(0)
> >  
> > +static inline void
> > +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > +{
> > +}
> > +
> >  #endif /* CONFIG_SWAP */
> >  #endif /* __KERNEL__*/
> >  #endif /* _LINUX_SWAP_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 0c9c1ad..89523cf 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
> >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> >  }
> >  
> > +#ifdef CONFIG_SWAP
> >  /*
> > - * called from __delete_from_swap_cache() and drop "page" account.
> > + * called after __delete_from_swap_cache() and drop "page" account.
> >   * memcg information is recorded to swap_cgroup of "ent"
> >   */
> >  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > @@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> >  	if (memcg)
> >  		css_put(&memcg->css);
> >  }
> > +#endif
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> >  /*
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 87f10d4..7624c89 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> >   */
> >  void __delete_from_swap_cache(struct page *page)
> >  {
> > -	swp_entry_t ent = {.val = page_private(page)};
> > -
> >  	VM_BUG_ON(!PageLocked(page));
> >  	VM_BUG_ON(!PageSwapCache(page));
> >  	VM_BUG_ON(PageWriteback(page));
> > @@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
> >  	__delete_from_swap_cache(page);
> >  	spin_unlock_irq(&swapper_space.tree_lock);
> >  
> > -	mem_cgroup_uncharge_swapcache(page, ent);
> > +	mem_cgroup_uncharge_swapcache(page, entry);
> >  	swap_free(entry);
> >  	page_cache_release(page);
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6c5988d..a7d7a06 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> >  		swp_entry_t swap = { .val = page_private(page) };
> >  		__delete_from_swap_cache(page);
> >  		spin_unlock_irq(&mapping->tree_lock);
> > -		mem_cgroup_uncharge_swapcache(page);
> > +		mem_cgroup_uncharge_swapcache(page, swap);
> >  		swap_free(swap);
> >  	} else {
> >  		__remove_from_page_cache(page);
> > ===
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
@ 2009-05-13  0:28               ` Daisuke Nishimura
  0 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-13  0:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, balbir, akpm, mingo,
	linux-kernel

On Wed, 13 May 2009 08:59:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 12 May 2009 19:58:23 +0900
> Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:
> 
> > On Tue, 12 May 2009 17:13:56 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Tue, 12 May 2009 17:00:07 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > hmm, I see.
> > > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > > > IMHO, we should make the period of spinlock as small as possible,
> > > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > > > 
> > > How about this ?
> > Looks good conceptually, but it cannot be built :)
> > 
> > It needs a fix like this.
> > Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
> > and CONFIG_SWAP.
> > 
> ok, will update. can I add you Signed-off-by on the patch ?
> 
Sure.

	Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

The patch(with my fix applied) seems to work fine, I need run it
for more long time though.

> Thanks,
> -Kame
> > ===
> >  include/linux/swap.h |    5 +++++
> >  mm/memcontrol.c      |    4 +++-
> >  mm/swap_state.c      |    4 +---
> >  mm/vmscan.c          |    2 +-
> >  4 files changed, 10 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index caf0767..6ea541d 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
> >  #define has_swap_token(x) 0
> >  #define disable_swap_token() do { } while(0)
> >  
> > +static inline void
> > +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > +{
> > +}
> > +
> >  #endif /* CONFIG_SWAP */
> >  #endif /* __KERNEL__*/
> >  #endif /* _LINUX_SWAP_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 0c9c1ad..89523cf 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
> >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> >  }
> >  
> > +#ifdef CONFIG_SWAP
> >  /*
> > - * called from __delete_from_swap_cache() and drop "page" account.
> > + * called after __delete_from_swap_cache() and drop "page" account.
> >   * memcg information is recorded to swap_cgroup of "ent"
> >   */
> >  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > @@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> >  	if (memcg)
> >  		css_put(&memcg->css);
> >  }
> > +#endif
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> >  /*
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 87f10d4..7624c89 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> >   */
> >  void __delete_from_swap_cache(struct page *page)
> >  {
> > -	swp_entry_t ent = {.val = page_private(page)};
> > -
> >  	VM_BUG_ON(!PageLocked(page));
> >  	VM_BUG_ON(!PageSwapCache(page));
> >  	VM_BUG_ON(PageWriteback(page));
> > @@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
> >  	__delete_from_swap_cache(page);
> >  	spin_unlock_irq(&swapper_space.tree_lock);
> >  
> > -	mem_cgroup_uncharge_swapcache(page, ent);
> > +	mem_cgroup_uncharge_swapcache(page, entry);
> >  	swap_free(entry);
> >  	page_cache_release(page);
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6c5988d..a7d7a06 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> >  		swp_entry_t swap = { .val = page_private(page) };
> >  		__delete_from_swap_cache(page);
> >  		spin_unlock_irq(&mapping->tree_lock);
> > -		mem_cgroup_uncharge_swapcache(page);
> > +		mem_cgroup_uncharge_swapcache(page, swap);
> >  		swap_free(swap);
> >  	} else {
> >  		__remove_from_page_cache(page);
> > ===
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
  2009-05-12  9:51   ` Balbir Singh
@ 2009-05-13  0:31     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-13  0:31 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura, akpm, mingo, linux-kernel

On Tue, 12 May 2009 15:21:58 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > The patch set includes followng
> >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> >  [2/3] fix swap cache handling race by avoidng readahead.
> >  [3/3] fix swap cache handling race by check swapcount again.
> > 
> > Result is good under my test.
> 
> What was the result (performance data impact) of disabling swap
> readahead? Otherwise, this looks the most reasonable set of patches
> for this problem.
> 
I'll measure some and report it in the next post.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
@ 2009-05-13  0:31     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-13  0:31 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura, akpm, mingo, linux-kernel

On Tue, 12 May 2009 15:21:58 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > The patch set includes followng
> >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> >  [2/3] fix swap cache handling race by avoidng readahead.
> >  [3/3] fix swap cache handling race by check swapcount again.
> > 
> > Result is good under my test.
> 
> What was the result (performance data impact) of disabling swap
> readahead? Otherwise, this looks the most reasonable set of patches
> for this problem.
> 
I'll measure some and report it in the next post.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
  2009-05-13  0:28               ` Daisuke Nishimura
@ 2009-05-13  0:32                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-13  0:32 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Daisuke Nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

On Wed, 13 May 2009 09:28:28 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 13 May 2009 08:59:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 12 May 2009 19:58:23 +0900
> > Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:
> > 
> > > On Tue, 12 May 2009 17:13:56 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Tue, 12 May 2009 17:00:07 +0900
> > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > > hmm, I see.
> > > > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > > > > IMHO, we should make the period of spinlock as small as possible,
> > > > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > > > > 
> > > > How about this ?
> > > Looks good conceptually, but it cannot be built :)
> > > 
> > > It needs a fix like this.
> > > Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
> > > and CONFIG_SWAP.
> > > 
> > ok, will update. can I add you Signed-off-by on the patch ?
> > 
> Sure.
> 
> 	Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> The patch(with my fix applied) seems to work fine, I need run it
> for more long time though.
> 
Ok, I'll treat this as an independent issue, not as "4/4".

Thanks,
-Kame


> > Thanks,
> > -Kame
> > > ===
> > >  include/linux/swap.h |    5 +++++
> > >  mm/memcontrol.c      |    4 +++-
> > >  mm/swap_state.c      |    4 +---
> > >  mm/vmscan.c          |    2 +-
> > >  4 files changed, 10 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index caf0767..6ea541d 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
> > >  #define has_swap_token(x) 0
> > >  #define disable_swap_token() do { } while(0)
> > >  
> > > +static inline void
> > > +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > +{
> > > +}
> > > +
> > >  #endif /* CONFIG_SWAP */
> > >  #endif /* __KERNEL__*/
> > >  #endif /* _LINUX_SWAP_H */
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 0c9c1ad..89523cf 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
> > >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> > >  }
> > >  
> > > +#ifdef CONFIG_SWAP
> > >  /*
> > > - * called from __delete_from_swap_cache() and drop "page" account.
> > > + * called after __delete_from_swap_cache() and drop "page" account.
> > >   * memcg information is recorded to swap_cgroup of "ent"
> > >   */
> > >  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > @@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > >  	if (memcg)
> > >  		css_put(&memcg->css);
> > >  }
> > > +#endif
> > >  
> > >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > >  /*
> > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > index 87f10d4..7624c89 100644
> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> > >   */
> > >  void __delete_from_swap_cache(struct page *page)
> > >  {
> > > -	swp_entry_t ent = {.val = page_private(page)};
> > > -
> > >  	VM_BUG_ON(!PageLocked(page));
> > >  	VM_BUG_ON(!PageSwapCache(page));
> > >  	VM_BUG_ON(PageWriteback(page));
> > > @@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
> > >  	__delete_from_swap_cache(page);
> > >  	spin_unlock_irq(&swapper_space.tree_lock);
> > >  
> > > -	mem_cgroup_uncharge_swapcache(page, ent);
> > > +	mem_cgroup_uncharge_swapcache(page, entry);
> > >  	swap_free(entry);
> > >  	page_cache_release(page);
> > >  }
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 6c5988d..a7d7a06 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> > >  		swp_entry_t swap = { .val = page_private(page) };
> > >  		__delete_from_swap_cache(page);
> > >  		spin_unlock_irq(&mapping->tree_lock);
> > > -		mem_cgroup_uncharge_swapcache(page);
> > > +		mem_cgroup_uncharge_swapcache(page, swap);
> > >  		swap_free(swap);
> > >  	} else {
> > >  		__remove_from_page_cache(page);
> > > ===
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
@ 2009-05-13  0:32                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-13  0:32 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Daisuke Nishimura, linux-mm, balbir, akpm, mingo, linux-kernel

On Wed, 13 May 2009 09:28:28 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 13 May 2009 08:59:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 12 May 2009 19:58:23 +0900
> > Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:
> > 
> > > On Tue, 12 May 2009 17:13:56 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Tue, 12 May 2009 17:00:07 +0900
> > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > > hmm, I see.
> > > > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > > > > IMHO, we should make the period of spinlock as small as possible,
> > > > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > > > > 
> > > > How about this ?
> > > Looks good conceptually, but it cannot be built :)
> > > 
> > > It needs a fix like this.
> > > Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
> > > and CONFIG_SWAP.
> > > 
> > ok, will update. can I add you Signed-off-by on the patch ?
> > 
> Sure.
> 
> 	Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> The patch(with my fix applied) seems to work fine, I need run it
> for more long time though.
> 
Ok, I'll treat this as an independent issue, not as "4/4".

Thanks,
-Kame


> > Thanks,
> > -Kame
> > > ===
> > >  include/linux/swap.h |    5 +++++
> > >  mm/memcontrol.c      |    4 +++-
> > >  mm/swap_state.c      |    4 +---
> > >  mm/vmscan.c          |    2 +-
> > >  4 files changed, 10 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index caf0767..6ea541d 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
> > >  #define has_swap_token(x) 0
> > >  #define disable_swap_token() do { } while(0)
> > >  
> > > +static inline void
> > > +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > +{
> > > +}
> > > +
> > >  #endif /* CONFIG_SWAP */
> > >  #endif /* __KERNEL__*/
> > >  #endif /* _LINUX_SWAP_H */
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 0c9c1ad..89523cf 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
> > >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> > >  }
> > >  
> > > +#ifdef CONFIG_SWAP
> > >  /*
> > > - * called from __delete_from_swap_cache() and drop "page" account.
> > > + * called after __delete_from_swap_cache() and drop "page" account.
> > >   * memcg information is recorded to swap_cgroup of "ent"
> > >   */
> > >  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > @@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > >  	if (memcg)
> > >  		css_put(&memcg->css);
> > >  }
> > > +#endif
> > >  
> > >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > >  /*
> > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > index 87f10d4..7624c89 100644
> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> > >   */
> > >  void __delete_from_swap_cache(struct page *page)
> > >  {
> > > -	swp_entry_t ent = {.val = page_private(page)};
> > > -
> > >  	VM_BUG_ON(!PageLocked(page));
> > >  	VM_BUG_ON(!PageSwapCache(page));
> > >  	VM_BUG_ON(PageWriteback(page));
> > > @@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
> > >  	__delete_from_swap_cache(page);
> > >  	spin_unlock_irq(&swapper_space.tree_lock);
> > >  
> > > -	mem_cgroup_uncharge_swapcache(page, ent);
> > > +	mem_cgroup_uncharge_swapcache(page, entry);
> > >  	swap_free(entry);
> > >  	page_cache_release(page);
> > >  }
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 6c5988d..a7d7a06 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> > >  		swp_entry_t swap = { .val = page_private(page) };
> > >  		__delete_from_swap_cache(page);
> > >  		spin_unlock_irq(&mapping->tree_lock);
> > > -		mem_cgroup_uncharge_swapcache(page);
> > > +		mem_cgroup_uncharge_swapcache(page, swap);
> > >  		swap_free(swap);
> > >  	} else {
> > >  		__remove_from_page_cache(page);
> > > ===
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> > > 
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
  2009-05-13  0:32                 ` KAMEZAWA Hiroyuki
@ 2009-05-13  3:55                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-13  3:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Daisuke Nishimura, linux-mm, balbir, akpm,
	mingo, linux-kernel

On Wed, 13 May 2009 09:32:50 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 May 2009 09:28:28 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Wed, 13 May 2009 08:59:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Tue, 12 May 2009 19:58:23 +0900
> > > Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:
> > > 
> > > > On Tue, 12 May 2009 17:13:56 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > On Tue, 12 May 2009 17:00:07 +0900
> > > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > > > hmm, I see.
> > > > > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > > > > > IMHO, we should make the period of spinlock as small as possible,
> > > > > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > > > > > 
> > > > > How about this ?
> > > > Looks good conceptually, but it cannot be built :)
> > > > 
> > > > It needs a fix like this.
> > > > Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
> > > > and CONFIG_SWAP.
> > > > 
> > > ok, will update. can I add you Signed-off-by on the patch ?
> > > 
> > Sure.
> > 
> > 	Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > 
> > The patch(with my fix applied) seems to work fine, I need run it
> > for more long time though.
> > 
> Ok, I'll treat this as an independent issue, not as "4/4".
> 
Could you merge mine and yours and send it to Andrew ?
I think this is a fix for dead-lock and priority is very high.
I foumd my system broken and installing the whole system again, now.
So, I can't post patch today.


> Thanks,
> -Kame
> 
> 
> > > Thanks,
> > > -Kame
> > > > ===
> > > >  include/linux/swap.h |    5 +++++
> > > >  mm/memcontrol.c      |    4 +++-
> > > >  mm/swap_state.c      |    4 +---
> > > >  mm/vmscan.c          |    2 +-
> > > >  4 files changed, 10 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index caf0767..6ea541d 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
> > > >  #define has_swap_token(x) 0
> > > >  #define disable_swap_token() do { } while(0)
> > > >  
> > > > +static inline void
> > > > +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > > +{
> > > > +}
> > > > +
> > > >  #endif /* CONFIG_SWAP */
> > > >  #endif /* __KERNEL__*/
> > > >  #endif /* _LINUX_SWAP_H */
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 0c9c1ad..89523cf 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
> > > >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> > > >  }
> > > >  
> > > > +#ifdef CONFIG_SWAP
> > > >  /*
> > > > - * called from __delete_from_swap_cache() and drop "page" account.
> > > > + * called after __delete_from_swap_cache() and drop "page" account.
> > > >   * memcg information is recorded to swap_cgroup of "ent"
> > > >   */
> > > >  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > > @@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > >  	if (memcg)
> > > >  		css_put(&memcg->css);
> > > >  }
> > > > +#endif
> > > >  
> > > >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > >  /*
> > > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > > index 87f10d4..7624c89 100644
> > > > --- a/mm/swap_state.c
> > > > +++ b/mm/swap_state.c
> > > > @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> > > >   */
> > > >  void __delete_from_swap_cache(struct page *page)
> > > >  {
> > > > -	swp_entry_t ent = {.val = page_private(page)};
> > > > -
> > > >  	VM_BUG_ON(!PageLocked(page));
> > > >  	VM_BUG_ON(!PageSwapCache(page));
> > > >  	VM_BUG_ON(PageWriteback(page));
> > > > @@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
> > > >  	__delete_from_swap_cache(page);
> > > >  	spin_unlock_irq(&swapper_space.tree_lock);
> > > >  
> > > > -	mem_cgroup_uncharge_swapcache(page, ent);
> > > > +	mem_cgroup_uncharge_swapcache(page, entry);
> > > >  	swap_free(entry);
> > > >  	page_cache_release(page);
> > > >  }
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 6c5988d..a7d7a06 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> > > >  		swp_entry_t swap = { .val = page_private(page) };
> > > >  		__delete_from_swap_cache(page);
> > > >  		spin_unlock_irq(&mapping->tree_lock);
> > > > -		mem_cgroup_uncharge_swapcache(page);
> > > > +		mem_cgroup_uncharge_swapcache(page, swap);
> > > >  		swap_free(swap);
> > > >  	} else {
> > > >  		__remove_from_page_cache(page);
> > > > ===
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > Please read the FAQ at  http://www.tux.org/lkml/
> > > > 
> > > 
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
@ 2009-05-13  3:55                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-13  3:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Daisuke Nishimura, linux-mm, balbir, akpm,
	mingo, linux-kernel

On Wed, 13 May 2009 09:32:50 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 May 2009 09:28:28 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Wed, 13 May 2009 08:59:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Tue, 12 May 2009 19:58:23 +0900
> > > Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:
> > > 
> > > > On Tue, 12 May 2009 17:13:56 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > On Tue, 12 May 2009 17:00:07 +0900
> > > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > > > hmm, I see.
> > > > > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
> > > > > > IMHO, we should make the period of spinlock as small as possible,
> > > > > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
> > > > > > 
> > > > > How about this ?
> > > > Looks good conceptually, but it cannot be built :)
> > > > 
> > > > It needs a fix like this.
> > > > Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
> > > > and CONFIG_SWAP.
> > > > 
> > > ok, will update. can I add you Signed-off-by on the patch ?
> > > 
> > Sure.
> > 
> > 	Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > 
> > The patch(with my fix applied) seems to work fine, I need run it
> > for more long time though.
> > 
> Ok, I'll treat this as an independent issue, not as "4/4".
> 
Could you merge mine and yours and send it to Andrew ?
I think this is a fix for dead-lock and priority is very high.
I foumd my system broken and installing the whole system again, now.
So, I can't post patch today.


> Thanks,
> -Kame
> 
> 
> > > Thanks,
> > > -Kame
> > > > ===
> > > >  include/linux/swap.h |    5 +++++
> > > >  mm/memcontrol.c      |    4 +++-
> > > >  mm/swap_state.c      |    4 +---
> > > >  mm/vmscan.c          |    2 +-
> > > >  4 files changed, 10 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index caf0767..6ea541d 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -431,6 +431,11 @@ static inline swp_entry_t get_swap_page(void)
> > > >  #define has_swap_token(x) 0
> > > >  #define disable_swap_token() do { } while(0)
> > > >  
> > > > +static inline void
> > > > +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > > +{
> > > > +}
> > > > +
> > > >  #endif /* CONFIG_SWAP */
> > > >  #endif /* __KERNEL__*/
> > > >  #endif /* _LINUX_SWAP_H */
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 0c9c1ad..89523cf 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1488,8 +1488,9 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
> > > >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> > > >  }
> > > >  
> > > > +#ifdef CONFIG_SWAP
> > > >  /*
> > > > - * called from __delete_from_swap_cache() and drop "page" account.
> > > > + * called after __delete_from_swap_cache() and drop "page" account.
> > > >   * memcg information is recorded to swap_cgroup of "ent"
> > > >   */
> > > >  void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > > @@ -1506,6 +1507,7 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> > > >  	if (memcg)
> > > >  		css_put(&memcg->css);
> > > >  }
> > > > +#endif
> > > >  
> > > >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > >  /*
> > > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > > index 87f10d4..7624c89 100644
> > > > --- a/mm/swap_state.c
> > > > +++ b/mm/swap_state.c
> > > > @@ -109,8 +109,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
> > > >   */
> > > >  void __delete_from_swap_cache(struct page *page)
> > > >  {
> > > > -	swp_entry_t ent = {.val = page_private(page)};
> > > > -
> > > >  	VM_BUG_ON(!PageLocked(page));
> > > >  	VM_BUG_ON(!PageSwapCache(page));
> > > >  	VM_BUG_ON(PageWriteback(page));
> > > > @@ -190,7 +188,7 @@ void delete_from_swap_cache(struct page *page)
> > > >  	__delete_from_swap_cache(page);
> > > >  	spin_unlock_irq(&swapper_space.tree_lock);
> > > >  
> > > > -	mem_cgroup_uncharge_swapcache(page, ent);
> > > > +	mem_cgroup_uncharge_swapcache(page, entry);
> > > >  	swap_free(entry);
> > > >  	page_cache_release(page);
> > > >  }
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 6c5988d..a7d7a06 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -470,7 +470,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> > > >  		swp_entry_t swap = { .val = page_private(page) };
> > > >  		__delete_from_swap_cache(page);
> > > >  		spin_unlock_irq(&mapping->tree_lock);
> > > > -		mem_cgroup_uncharge_swapcache(page);
> > > > +		mem_cgroup_uncharge_swapcache(page, swap);
> > > >  		swap_free(swap);
> > > >  	} else {
> > > >  		__remove_from_page_cache(page);
> > > > ===
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > Please read the FAQ at  http://www.tux.org/lkml/
> > > > 
> > > 
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
  2009-05-13  3:55                   ` KAMEZAWA Hiroyuki
@ 2009-05-13  4:11                     ` nishimura
  -1 siblings, 0 replies; 52+ messages in thread
From: nishimura @ 2009-05-13  4:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Daisuke Nishimura, linux-mm, balbir, akpm,
	mingo, linux-kernel

> On Wed, 13 May 2009 09:32:50 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
>> On Wed, 13 May 2009 09:28:28 +0900
>> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>> 
>> > On Wed, 13 May 2009 08:59:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > > On Tue, 12 May 2009 19:58:23 +0900
>> > > Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:
>> > > 
>> > > > On Tue, 12 May 2009 17:13:56 +0900
>> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > > > 
>> > > > > On Tue, 12 May 2009 17:00:07 +0900
>> > > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>> > > > > > hmm, I see.
>> > > > > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
>> > > > > > IMHO, we should make the period of spinlock as small as possible,
>> > > > > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
>> > > > > > 
>> > > > > How about this ?
>> > > > Looks good conceptually, but it cannot be built :)
>> > > > 
>> > > > It needs a fix like this.
>> > > > Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
>> > > > and CONFIG_SWAP.
>> > > > 
>> > > ok, will update. can I add you Signed-off-by on the patch ?
>> > > 
>> > Sure.
>> > 
>> > 	Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>> > 
>> > The patch(with my fix applied) seems to work fine, I need run it
>> > for more long time though.
>> > 
>> Ok, I'll treat this as an independent issue, not as "4/4".
>> 
> Could you merge mine and yours and send it to Andrew ?
> I think this is a fix for dead-lock and priority is very high.
> I foumd my system broken and installing the whole system again, now.
> So, I can't post patch today.
> 
Okey.

I'll post it soon.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock
@ 2009-05-13  4:11                     ` nishimura
  0 siblings, 0 replies; 52+ messages in thread
From: nishimura @ 2009-05-13  4:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Daisuke Nishimura, linux-mm, balbir, akpm,
	mingo, linux-kernel

> On Wed, 13 May 2009 09:32:50 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
>> On Wed, 13 May 2009 09:28:28 +0900
>> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>> 
>> > On Wed, 13 May 2009 08:59:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > > On Tue, 12 May 2009 19:58:23 +0900
>> > > Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:
>> > > 
>> > > > On Tue, 12 May 2009 17:13:56 +0900
>> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > > > 
>> > > > > On Tue, 12 May 2009 17:00:07 +0900
>> > > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>> > > > > > hmm, I see.
>> > > > > > cache_charge is outside of tree_lock, so moving uncharge would make sense.
>> > > > > > IMHO, we should make the period of spinlock as small as possible,
>> > > > > > and charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock.
>> > > > > > 
>> > > > > How about this ?
>> > > > Looks good conceptually, but it cannot be built :)
>> > > > 
>> > > > It needs a fix like this.
>> > > > Passed build test with enabling/disabling both CONFIG_MEM_RES_CTLR
>> > > > and CONFIG_SWAP.
>> > > > 
>> > > ok, will update. can I add you Signed-off-by on the patch ?
>> > > 
>> > Sure.
>> > 
>> > 	Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>> > 
>> > The patch(with my fix applied) seems to work fine, I need run it
>> > for more long time though.
>> > 
>> Ok, I'll treat this as an independent issue, not as "4/4".
>> 
> Could you merge mine and yours and send it to Andrew ?
> I think this is a fix for dead-lock and priority is very high.
> I foumd my system broken and installing the whole system again, now.
> So, I can't post patch today.
> 
Okey.

I'll post it soon.

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
  2009-05-12 23:58       ` KAMEZAWA Hiroyuki
@ 2009-05-13 11:18         ` Johannes Weiner
  -1 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2009-05-13 11:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, akpm, mingo, Hugh Dickins, linux-kernel

On Wed, May 13, 2009 at 08:58:16AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 12 May 2009 13:24:00 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > >
> > > Index: mmotm-2.6.30-May07/mm/swap_state.c
> > > ===================================================================
> > > --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> > > +++ mmotm-2.6.30-May07/mm/swap_state.c
> > > @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
> > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > >  			struct vm_area_struct *vma, unsigned long addr)
> > >  {
> > > -	int nr_pages;
> > > +	int nr_pages = 1;
> > >  	struct page *page;
> > > -	unsigned long offset;
> > > +	unsigned long offset = 0;
> > >  	unsigned long end_offset;
> > >  
> > >  	/*
> > > @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
> > >  	 * No, it's very unlikely that swap layout would follow vma layout,
> > >  	 * more likely that neighbouring swap pages came from the same node:
> > >  	 * so use the same "addr" to choose the same node for each swap read.
> > > +	 *
> > > +	 * But, when memcg is used, swapin readahead give us some bad
> > > +	 * effects. There are 2 big problems in general.
> > > +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> > > +	 *    And _not required_ memory is only freed by global LRU.
> > > +	 * 2. We can't charge pages for swap-cache readahead because
> > > +	 *    we should avoid account memory in a cgroup which a
> > > +	 *    thread call this function is not related to.
> > > +	 * And swapin-readahead have racy condition with
> > > +	 * free_swap_and_cache(). This also annoys memcg.
> > > +	 * Then, if memcg is really used, we avoid readahead.
> > >  	 */
> > > -	nr_pages = valid_swaphandles(entry, &offset);
> > > +
> > > +	if (!mem_cgroup_activated())
> > > +		nr_pages = valid_swaphandles(entry, &offset);
> > > +
> > >  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> > >  		/* Ok, do the async read-ahead now */
> > >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> > 
> > Having nr_pages set to 1 and offset to zero will actually enter hat
> > loop and try to read a swap slot at offset zero, including a
> > superfluous page allocation, just to fail at the swap_duplicate()
> > (swap slot 0 is swap header -> SWAP_MAP_BAD).
> > 
> Hmm ?
>  swp_entry(swp_type(entry), offset),
> can be zero ?

I'm not sure I understand your question.  Whether this whole
expression can or can not be zero is irrelevant.  My point is that you
enter the readahead loop with a bogus offset, while your original
intention is to completey disable readahead.

> > How about:
> > 
> > 	if (mem_cgroup_activated())
> > 		goto pivot;
> > 	nr_pages = valid_swaphandles(...);
> > 	for (readahead loop)
> > 		...
> > pivot:
> > 	return read_swap_cache_async();
> > 
> > That will also save you the runtime initialization of nr_pages and
> > offset completely when the cgroup is active.  And you'll have only one
> > branch and no second one for offset < end_offset in the loop.  And the
> > lru draining, but I'm not sure about that.  I think it's not needed.
> > 
> Hmm. I'm not sure why lru_add_drain()->read_swap_cache_async() is inserted before returing
> to caller. Is the page to be returned isn't necessary to be on LRU ?

I'm not sure either.  Neither the fault handler nor concurrent
swap-ins seem to care.  I added Hugh on CC.

	Hannes

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
@ 2009-05-13 11:18         ` Johannes Weiner
  0 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2009-05-13 11:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, akpm, mingo, Hugh Dickins, linux-kernel

On Wed, May 13, 2009 at 08:58:16AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 12 May 2009 13:24:00 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > >
> > > Index: mmotm-2.6.30-May07/mm/swap_state.c
> > > ===================================================================
> > > --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> > > +++ mmotm-2.6.30-May07/mm/swap_state.c
> > > @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
> > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > >  			struct vm_area_struct *vma, unsigned long addr)
> > >  {
> > > -	int nr_pages;
> > > +	int nr_pages = 1;
> > >  	struct page *page;
> > > -	unsigned long offset;
> > > +	unsigned long offset = 0;
> > >  	unsigned long end_offset;
> > >  
> > >  	/*
> > > @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
> > >  	 * No, it's very unlikely that swap layout would follow vma layout,
> > >  	 * more likely that neighbouring swap pages came from the same node:
> > >  	 * so use the same "addr" to choose the same node for each swap read.
> > > +	 *
> > > +	 * But, when memcg is used, swapin readahead give us some bad
> > > +	 * effects. There are 2 big problems in general.
> > > +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> > > +	 *    And _not required_ memory is only freed by global LRU.
> > > +	 * 2. We can't charge pages for swap-cache readahead because
> > > +	 *    we should avoid account memory in a cgroup which a
> > > +	 *    thread call this function is not related to.
> > > +	 * And swapin-readahead have racy condition with
> > > +	 * free_swap_and_cache(). This also annoys memcg.
> > > +	 * Then, if memcg is really used, we avoid readahead.
> > >  	 */
> > > -	nr_pages = valid_swaphandles(entry, &offset);
> > > +
> > > +	if (!mem_cgroup_activated())
> > > +		nr_pages = valid_swaphandles(entry, &offset);
> > > +
> > >  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> > >  		/* Ok, do the async read-ahead now */
> > >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> > 
> > Having nr_pages set to 1 and offset to zero will actually enter hat
> > loop and try to read a swap slot at offset zero, including a
> > superfluous page allocation, just to fail at the swap_duplicate()
> > (swap slot 0 is swap header -> SWAP_MAP_BAD).
> > 
> Hmm ?
>  swp_entry(swp_type(entry), offset),
> can be zero ?

I'm not sure I understand your question.  Whether this whole
expression can or can not be zero is irrelevant.  My point is that you
enter the readahead loop with a bogus offset, while your original
intention is to completey disable readahead.

> > How about:
> > 
> > 	if (mem_cgroup_activated())
> > 		goto pivot;
> > 	nr_pages = valid_swaphandles(...);
> > 	for (readahead loop)
> > 		...
> > pivot:
> > 	return read_swap_cache_async();
> > 
> > That will also save you the runtime initialization of nr_pages and
> > offset completely when the cgroup is active.  And you'll have only one
> > branch and no second one for offset < end_offset in the loop.  And the
> > lru draining, but I'm not sure about that.  I think it's not needed.
> > 
> Hmm. I'm not sure why lru_add_drain()->read_swap_cache_async() is inserted before returing
> to caller. Is the page to be returned isn't necessary to be on LRU ?

I'm not sure either.  Neither the fault handler nor concurrent
swap-ins seem to care.  I added Hugh on CC.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
  2009-05-13 11:18         ` Johannes Weiner
@ 2009-05-13 18:03           ` Hugh Dickins
  -1 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2009-05-13 18:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura, akpm, mingo,
	linux-kernel

On Wed, 13 May 2009, Johannes Weiner wrote:
> On Wed, May 13, 2009 at 08:58:16AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 12 May 2009 13:24:00 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > >
> > > > Index: mmotm-2.6.30-May07/mm/swap_state.c
> > > > ===================================================================
> > > > --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> > > > +++ mmotm-2.6.30-May07/mm/swap_state.c
> > > > @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
> > > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > >  			struct vm_area_struct *vma, unsigned long addr)
> > > >  {
> > > > -	int nr_pages;
> > > > +	int nr_pages = 1;
> > > >  	struct page *page;
> > > > -	unsigned long offset;
> > > > +	unsigned long offset = 0;
> > > >  	unsigned long end_offset;
> > > >  
> > > >  	/*
> > > > @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
> > > >  	 * No, it's very unlikely that swap layout would follow vma layout,
> > > >  	 * more likely that neighbouring swap pages came from the same node:
> > > >  	 * so use the same "addr" to choose the same node for each swap read.
> > > > +	 *
> > > > +	 * But, when memcg is used, swapin readahead give us some bad
> > > > +	 * effects. There are 2 big problems in general.
> > > > +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> > > > +	 *    And _not required_ memory is only freed by global LRU.
> > > > +	 * 2. We can't charge pages for swap-cache readahead because
> > > > +	 *    we should avoid account memory in a cgroup which a
> > > > +	 *    thread call this function is not related to.
> > > > +	 * And swapin-readahead have racy condition with
> > > > +	 * free_swap_and_cache(). This also annoys memcg.
> > > > +	 * Then, if memcg is really used, we avoid readahead.
> > > >  	 */
> > > > -	nr_pages = valid_swaphandles(entry, &offset);
> > > > +
> > > > +	if (!mem_cgroup_activated())
> > > > +		nr_pages = valid_swaphandles(entry, &offset);
> > > > +
> > > >  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> > > >  		/* Ok, do the async read-ahead now */
> > > >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> > > 
> > > Having nr_pages set to 1 and offset to zero will actually enter hat
> > > loop and try to read a swap slot at offset zero, including a
> > > superfluous page allocation, just to fail at the swap_duplicate()
> > > (swap slot 0 is swap header -> SWAP_MAP_BAD).
> > > 
> > Hmm ?
> >  swp_entry(swp_type(entry), offset),
> > can be zero ?
> 
> I'm not sure I understand your question.

Nor I, but I think KAMEZAWA-san is suggesting that we never come here
with offset 0 anyway.  Which I believe is correct.

(And in passing, off topic, note that we have a problem if we ever
do need to read page 0 in this way, in the swap-to-regular-file case: 
because the swap_extents reading of page 0 can differ from sys_swapon's
reading of the header page without swap_extents - possibly hibernate
to swapfile can suffer from that, but not regular swapping paths.)

> Whether this whole
> expression can or can not be zero is irrelevant.  My point is that you
> enter the readahead loop with a bogus offset, while your original
> intention is to completey disable readahead.

I don't really buy your point on offset 0 in particular: if offset 0
is asked for, it goes through the intended motions, though you and I
know that offset 0 will subsequently be found unsuitable; but it does
what is asked of it.

However, I do agree with you that it's silly to be entering this loop
at all when avoiding readahead.  When doing readahead, we have to cope
with the fact that any of the calls in the loop might have failed, so
we do the extra, targetted read_swap_cache_async at the end, to satisfy
the actual request.  When avoiding readahead, better just to go to that
final read_swap_cache_async, instead of duplicating it and compensating
with a page_cache_release too.

Which is what initializing nr_pages = 0 should achieve: see how
valid_swaphandles() returns 0 rather than 1 when avoiding readahead,
precisely to avoid the unnecessary duplication.  So I'd recommend
nr_pages = 0 rather than nr_pages = 1 at the top.

> 
> > > How about:
> > > 
> > > 	if (mem_cgroup_activated())
> > > 		goto pivot;
> > > 	nr_pages = valid_swaphandles(...);
> > > 	for (readahead loop)
> > > 		...
> > > pivot:
> > > 	return read_swap_cache_async();
> > > 
> > > That will also save you the runtime initialization of nr_pages and
> > > offset completely when the cgroup is active.  And you'll have only one
> > > branch and no second one for offset < end_offset in the loop.  And the
> > > lru draining, but I'm not sure about that.  I think it's not needed.
> > > 
> > Hmm. I'm not sure why lru_add_drain()->read_swap_cache_async() is inserted before returing
> > to caller. Is the page to be returned isn't necessary to be on LRU ?
> 
> I'm not sure either.  Neither the fault handler nor concurrent
> swap-ins seem to care.  I added Hugh on CC.

Thanks, though you've probably got me from git-blame identifying when
I moved that code around: the person you really want is akpm, then
@digeo.com, in ChangeLog-2.5.46:

	[PATCH] empty the deferred lru-addition buffers in swapin_readahead
	
	If we're about to return to userspace after performing some swap
	readahead, the pages in the deferred-addition LRU queues could stay
	there for some time.  So drain them after performing readahead.

I suspect that's a "seems like a good idea, especially if we've many cpus"
(which I do agree with), rather than a practical finding in some workload.
If we've read in a number of pages which quite possibly will prove of no
use to anyone, better have them visible to reclaim on the LRU as soon as
possible, rather than stuck indefinitely in per-cpu vectors.

The non-readahead case is a little different, in that you know the one
page is really of use to someone; so it's less important to drain in
that case, but whether worth avoiding the drain I don't know.

(On the general matter of these patches: mostly these days I find no
time to do better than let the memcg people go their own way.  I do
like these minimal patches much better than those putting blocks of
#ifdef'ed code into mm/page_io.c and mm/vmscan.c etc.  But we'll need
to see what how badly suppressing readahead works out - as I've said
before, I'm no devout believer in readahead here, but have observed
in the past that it really does help.  I always thought that to handle
swapin readahead correctly, the memcg people would need to record the
cgs of what's on swap.)

Hugh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
@ 2009-05-13 18:03           ` Hugh Dickins
  0 siblings, 0 replies; 52+ messages in thread
From: Hugh Dickins @ 2009-05-13 18:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura, akpm, mingo,
	linux-kernel

On Wed, 13 May 2009, Johannes Weiner wrote:
> On Wed, May 13, 2009 at 08:58:16AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 12 May 2009 13:24:00 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > >
> > > > Index: mmotm-2.6.30-May07/mm/swap_state.c
> > > > ===================================================================
> > > > --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> > > > +++ mmotm-2.6.30-May07/mm/swap_state.c
> > > > @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
> > > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > >  			struct vm_area_struct *vma, unsigned long addr)
> > > >  {
> > > > -	int nr_pages;
> > > > +	int nr_pages = 1;
> > > >  	struct page *page;
> > > > -	unsigned long offset;
> > > > +	unsigned long offset = 0;
> > > >  	unsigned long end_offset;
> > > >  
> > > >  	/*
> > > > @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
> > > >  	 * No, it's very unlikely that swap layout would follow vma layout,
> > > >  	 * more likely that neighbouring swap pages came from the same node:
> > > >  	 * so use the same "addr" to choose the same node for each swap read.
> > > > +	 *
> > > > +	 * But, when memcg is used, swapin readahead give us some bad
> > > > +	 * effects. There are 2 big problems in general.
> > > > +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> > > > +	 *    And _not required_ memory is only freed by global LRU.
> > > > +	 * 2. We can't charge pages for swap-cache readahead because
> > > > +	 *    we should avoid account memory in a cgroup which a
> > > > +	 *    thread call this function is not related to.
> > > > +	 * And swapin-readahead have racy condition with
> > > > +	 * free_swap_and_cache(). This also annoys memcg.
> > > > +	 * Then, if memcg is really used, we avoid readahead.
> > > >  	 */
> > > > -	nr_pages = valid_swaphandles(entry, &offset);
> > > > +
> > > > +	if (!mem_cgroup_activated())
> > > > +		nr_pages = valid_swaphandles(entry, &offset);
> > > > +
> > > >  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> > > >  		/* Ok, do the async read-ahead now */
> > > >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> > > 
> > > Having nr_pages set to 1 and offset to zero will actually enter hat
> > > loop and try to read a swap slot at offset zero, including a
> > > superfluous page allocation, just to fail at the swap_duplicate()
> > > (swap slot 0 is swap header -> SWAP_MAP_BAD).
> > > 
> > Hmm ?
> >  swp_entry(swp_type(entry), offset),
> > can be zero ?
> 
> I'm not sure I understand your question.

Nor I, but I think KAMEZAWA-san is suggesting that we never come here
with offset 0 anyway.  Which I believe is correct.

(And in passing, off topic, note that we have a problem if we ever
do need to read page 0 in this way, in the swap-to-regular-file case: 
because the swap_extents reading of page 0 can differ from sys_swapon's
reading of the header page without swap_extents - possibly hibernate
to swapfile can suffer from that, but not regular swapping paths.)

> Whether this whole
> expression can or can not be zero is irrelevant.  My point is that you
> enter the readahead loop with a bogus offset, while your original
> intention is to completey disable readahead.

I don't really buy your point on offset 0 in particular: if offset 0
is asked for, it goes through the intended motions, though you and I
know that offset 0 will subsequently be found unsuitable; but it does
what is asked of it.

However, I do agree with you that it's silly to be entering this loop
at all when avoiding readahead.  When doing readahead, we have to cope
with the fact that any of the calls in the loop might have failed, so
we do the extra, targetted read_swap_cache_async at the end, to satisfy
the actual request.  When avoiding readahead, better just to go to that
final read_swap_cache_async, instead of duplicating it and compensating
with a page_cache_release too.

Which is what initializing nr_pages = 0 should achieve: see how
valid_swaphandles() returns 0 rather than 1 when avoiding readahead,
precisely to avoid the unnecessary duplication.  So I'd recommend
nr_pages = 0 rather than nr_pages = 1 at the top.

> 
> > > How about:
> > > 
> > > 	if (mem_cgroup_activated())
> > > 		goto pivot;
> > > 	nr_pages = valid_swaphandles(...);
> > > 	for (readahead loop)
> > > 		...
> > > pivot:
> > > 	return read_swap_cache_async();
> > > 
> > > That will also save you the runtime initialization of nr_pages and
> > > offset completely when the cgroup is active.  And you'll have only one
> > > branch and no second one for offset < end_offset in the loop.  And the
> > > lru draining, but I'm not sure about that.  I think it's not needed.
> > > 
> > Hmm. I'm not sure why lru_add_drain()->read_swap_cache_async() is inserted before returing
> > to caller. Is the page to be returned isn't necessary to be on LRU ?
> 
> I'm not sure either.  Neither the fault handler nor concurrent
> swap-ins seem to care.  I added Hugh on CC.

Thanks, though you've probably got me from git-blame identifying when
I moved that code around: the person you really want is akpm, then
@digeo.com, in ChangeLog-2.5.46:

	[PATCH] empty the deferred lru-addition buffers in swapin_readahead
	
	If we're about to return to userspace after performing some swap
	readahead, the pages in the deferred-addition LRU queues could stay
	there for some time.  So drain them after performing readahead.

I suspect that's a "seems like a good idea, especially if we've many cpus"
(which I do agree with), rather than a practical finding in some workload.
If we've read in a number of pages which quite possibly will prove of no
use to anyone, better have them visible to reclaim on the LRU as soon as
possible, rather than stuck indefinitely in per-cpu vectors.

The non-readahead case is a little different, in that you know the one
page is really of use to someone; so it's less important to drain in
that case, but whether worth avoiding the drain I don't know.

(On the general matter of these patches: mostly these days I find no
time to do better than let the memcg people go their own way.  I do
like these minimal patches much better than those putting blocks of
#ifdef'ed code into mm/page_io.c and mm/vmscan.c etc.  But we'll need
to see what how badly suppressing readahead works out - as I've said
before, I'm no devout believer in readahead here, but have observed
in the past that it really does help.  I always thought that to handle
swapin readahead correctly, the memcg people would need to record the
cgs of what's on swap.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
  2009-05-13 18:03           ` Hugh Dickins
@ 2009-05-14  0:05             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-14  0:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

Johannes, Hugh, thank you for comments.


On Wed, 13 May 2009 19:03:47 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> On Wed, 13 May 2009, Johannes Weiner wrote:
> > On Wed, May 13, 2009 at 08:58:16AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 12 May 2009 13:24:00 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > >
> > > > > Index: mmotm-2.6.30-May07/mm/swap_state.c
> > > > > ===================================================================
> > > > > --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> > > > > +++ mmotm-2.6.30-May07/mm/swap_state.c
> > > > > @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
> > > > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > > >  			struct vm_area_struct *vma, unsigned long addr)
> > > > >  {
> > > > > -	int nr_pages;
> > > > > +	int nr_pages = 1;
> > > > >  	struct page *page;
> > > > > -	unsigned long offset;
> > > > > +	unsigned long offset = 0;
> > > > >  	unsigned long end_offset;
> > > > >  
> > > > >  	/*
> > > > > @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
> > > > >  	 * No, it's very unlikely that swap layout would follow vma layout,
> > > > >  	 * more likely that neighbouring swap pages came from the same node:
> > > > >  	 * so use the same "addr" to choose the same node for each swap read.
> > > > > +	 *
> > > > > +	 * But, when memcg is used, swapin readahead give us some bad
> > > > > +	 * effects. There are 2 big problems in general.
> > > > > +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> > > > > +	 *    And _not required_ memory is only freed by global LRU.
> > > > > +	 * 2. We can't charge pages for swap-cache readahead because
> > > > > +	 *    we should avoid account memory in a cgroup which a
> > > > > +	 *    thread call this function is not related to.
> > > > > +	 * And swapin-readahead have racy condition with
> > > > > +	 * free_swap_and_cache(). This also annoys memcg.
> > > > > +	 * Then, if memcg is really used, we avoid readahead.
> > > > >  	 */
> > > > > -	nr_pages = valid_swaphandles(entry, &offset);
> > > > > +
> > > > > +	if (!mem_cgroup_activated())
> > > > > +		nr_pages = valid_swaphandles(entry, &offset);
> > > > > +
> > > > >  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> > > > >  		/* Ok, do the async read-ahead now */
> > > > >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> > > > 
> > > > Having nr_pages set to 1 and offset to zero will actually enter hat
> > > > loop and try to read a swap slot at offset zero, including a
> > > > superfluous page allocation, just to fail at the swap_duplicate()
> > > > (swap slot 0 is swap header -> SWAP_MAP_BAD).
> > > > 
> > > Hmm ?
> > >  swp_entry(swp_type(entry), offset),
> > > can be zero ?
> > 
> > I'm not sure I understand your question.
> 
> Nor I, but I think KAMEZAWA-san is suggesting that we never come here
> with offset 0 anyway.  Which I believe is correct.
> 
Sorry for too short text. yes, I thought so.
 
> (And in passing, off topic, note that we have a problem if we ever
> do need to read page 0 in this way, in the swap-to-regular-file case: 
> because the swap_extents reading of page 0 can differ from sys_swapon's
> reading of the header page without swap_extents - possibly hibernate
> to swapfile can suffer from that, but not regular swapping paths.)
> 
> > Whether this whole
> > expression can or can not be zero is irrelevant.  My point is that you
> > enter the readahead loop with a bogus offset, while your original
> > intention is to completey disable readahead.
> 
> I don't really buy your point on offset 0 in particular: if offset 0
> is asked for, it goes through the intended motions, though you and I
> know that offset 0 will subsequently be found unsuitable; but it does
> what is asked of it.
> 
> However, I do agree with you that it's silly to be entering this loop
> at all when avoiding readahead.  When doing readahead, we have to cope
> with the fact that any of the calls in the loop might have failed, so
> we do the extra, targetted read_swap_cache_async at the end, to satisfy
> the actual request.  When avoiding readahead, better just to go to that
> final read_swap_cache_async, instead of duplicating it and compensating
> with a page_cache_release too.
> 
Ok, will do that.

> Which is what initializing nr_pages = 0 should achieve: see how
> valid_swaphandles() returns 0 rather than 1 when avoiding readahead,
> precisely to avoid the unnecessary duplication.  So I'd recommend
> nr_pages = 0 rather than nr_pages = 1 at the top.
> 

I see.

> > 
> > > > How about:
> > > > 
> > > > 	if (mem_cgroup_activated())
> > > > 		goto pivot;
> > > > 	nr_pages = valid_swaphandles(...);
> > > > 	for (readahead loop)
> > > > 		...
> > > > pivot:
> > > > 	return read_swap_cache_async();
> > > > 
> > > > That will also save you the runtime initialization of nr_pages and
> > > > offset completely when the cgroup is active.  And you'll have only one
> > > > branch and no second one for offset < end_offset in the loop.  And the
> > > > lru draining, but I'm not sure about that.  I think it's not needed.
> > > > 
> > > Hmm. I'm not sure why lru_add_drain()->read_swap_cache_async() is inserted before returing
> > > to caller. Is the page to be returned isn't necessary to be on LRU ?
> > 
> > I'm not sure either.  Neither the fault handler nor concurrent
> > swap-ins seem to care.  I added Hugh on CC.
> 
> Thanks, though you've probably got me from git-blame identifying when
> I moved that code around: the person you really want is akpm, then
> @digeo.com, in ChangeLog-2.5.46:
> 
> 	[PATCH] empty the deferred lru-addition buffers in swapin_readahead
> 	
> 	If we're about to return to userspace after performing some swap
> 	readahead, the pages in the deferred-addition LRU queues could stay
> 	there for some time.  So drain them after performing readahead.
> 
> I suspect that's a "seems like a good idea, especially if we've many cpus"
> (which I do agree with), rather than a practical finding in some workload.
> If we've read in a number of pages which quite possibly will prove of no
> use to anyone, better have them visible to reclaim on the LRU as soon as
> possible, rather than stuck indefinitely in per-cpu vectors.
> 
> The non-readahead case is a little different, in that you know the one
> page is really of use to someone; so it's less important to drain in
> that case, but whether worth avoiding the drain I don't know.
> 
Thank you very much ! It seems lru_add_drain is not necessary, here.


> (On the general matter of these patches: mostly these days I find no
> time to do better than let the memcg people go their own way.  I do
> like these minimal patches much better than those putting blocks of
> #ifdef'ed code into mm/page_io.c and mm/vmscan.c etc.
And it was not perfect ;(

> But we'll need
> to see what how badly suppressing readahead works out - as I've said
> before, I'm no devout believer in readahead here, but have observed
> in the past that it really does help.  I always thought that to handle
> swapin readahead correctly, the memcg people would need to record the
> cgs of what's on swap.)
> 

IIUC, we record it. But to use the record correctly, we have to "charge" newly
swapped-in pages which may not be used and may be under other cgroup which
we are not under. With big modification, we may be able to charge at
add_to_swap_cache(). But in this case, (when page_cluster=3)

  swp_entry + 0  -> cg0
  swp_entry + 1  -> cg1
  ..............
  swp_entry + 7  -> cg7

We may have to charge 8cgs and has to call try_to_free_pages() in 8cgs with
risk of OOM in other cgroup.

What I think ideal is
at swap_readahead()
==
  swp_entry + 0  -> special cg (or lru) for unused swap
  swp_entry + 1  -> special cg (or lru) for unused swap
  ...
  swp_entry + 7  -> special cg (or lru) for unused swap
==
And move SwapCaches in sepcial cg to real cg when they are mapped
(or added to shmem's address space).

But, now, I can't find a good way to add "special cg" without complicated races
and tons of codes. (Then, bugfix in this way will add more new bugs, I think.)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] fix swap cache account leak at swapin-readahead
@ 2009-05-14  0:05             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-14  0:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, linux-mm, balbir, nishimura, akpm, mingo, linux-kernel

Johannes, Hugh, thank you for comments.


On Wed, 13 May 2009 19:03:47 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> On Wed, 13 May 2009, Johannes Weiner wrote:
> > On Wed, May 13, 2009 at 08:58:16AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 12 May 2009 13:24:00 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > On Tue, May 12, 2009 at 10:46:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > >
> > > > > Index: mmotm-2.6.30-May07/mm/swap_state.c
> > > > > ===================================================================
> > > > > --- mmotm-2.6.30-May07.orig/mm/swap_state.c
> > > > > +++ mmotm-2.6.30-May07/mm/swap_state.c
> > > > > @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
> > > > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > > >  			struct vm_area_struct *vma, unsigned long addr)
> > > > >  {
> > > > > -	int nr_pages;
> > > > > +	int nr_pages = 1;
> > > > >  	struct page *page;
> > > > > -	unsigned long offset;
> > > > > +	unsigned long offset = 0;
> > > > >  	unsigned long end_offset;
> > > > >  
> > > > >  	/*
> > > > > @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
> > > > >  	 * No, it's very unlikely that swap layout would follow vma layout,
> > > > >  	 * more likely that neighbouring swap pages came from the same node:
> > > > >  	 * so use the same "addr" to choose the same node for each swap read.
> > > > > +	 *
> > > > > +	 * But, when memcg is used, swapin readahead give us some bad
> > > > > +	 * effects. There are 2 big problems in general.
> > > > > +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> > > > > +	 *    And _not required_ memory is only freed by global LRU.
> > > > > +	 * 2. We can't charge pages for swap-cache readahead because
> > > > > +	 *    we should avoid account memory in a cgroup which a
> > > > > +	 *    thread call this function is not related to.
> > > > > +	 * And swapin-readahead have racy condition with
> > > > > +	 * free_swap_and_cache(). This also annoys memcg.
> > > > > +	 * Then, if memcg is really used, we avoid readahead.
> > > > >  	 */
> > > > > -	nr_pages = valid_swaphandles(entry, &offset);
> > > > > +
> > > > > +	if (!mem_cgroup_activated())
> > > > > +		nr_pages = valid_swaphandles(entry, &offset);
> > > > > +
> > > > >  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
> > > > >  		/* Ok, do the async read-ahead now */
> > > > >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> > > > 
> > > > Having nr_pages set to 1 and offset to zero will actually enter hat
> > > > loop and try to read a swap slot at offset zero, including a
> > > > superfluous page allocation, just to fail at the swap_duplicate()
> > > > (swap slot 0 is swap header -> SWAP_MAP_BAD).
> > > > 
> > > Hmm ?
> > >  swp_entry(swp_type(entry), offset),
> > > can be zero ?
> > 
> > I'm not sure I understand your question.
> 
> Nor I, but I think KAMEZAWA-san is suggesting that we never come here
> with offset 0 anyway.  Which I believe is correct.
> 
Sorry for too short text. yes, I thought so.
 
> (And in passing, off topic, note that we have a problem if we ever
> do need to read page 0 in this way, in the swap-to-regular-file case: 
> because the swap_extents reading of page 0 can differ from sys_swapon's
> reading of the header page without swap_extents - possibly hibernate
> to swapfile can suffer from that, but not regular swapping paths.)
> 
> > Whether this whole
> > expression can or can not be zero is irrelevant.  My point is that you
> > enter the readahead loop with a bogus offset, while your original
> > intention is to completey disable readahead.
> 
> I don't really buy your point on offset 0 in particular: if offset 0
> is asked for, it goes through the intended motions, though you and I
> know that offset 0 will subsequently be found unsuitable; but it does
> what is asked of it.
> 
> However, I do agree with you that it's silly to be entering this loop
> at all when avoiding readahead.  When doing readahead, we have to cope
> with the fact that any of the calls in the loop might have failed, so
> we do the extra, targetted read_swap_cache_async at the end, to satisfy
> the actual request.  When avoiding readahead, better just to go to that
> final read_swap_cache_async, instead of duplicating it and compensating
> with a page_cache_release too.
> 
Ok, will do that.

> Which is what initializing nr_pages = 0 should achieve: see how
> valid_swaphandles() returns 0 rather than 1 when avoiding readahead,
> precisely to avoid the unnecessary duplication.  So I'd recommend
> nr_pages = 0 rather than nr_pages = 1 at the top.
> 

I see.

> > 
> > > > How about:
> > > > 
> > > > 	if (mem_cgroup_activated())
> > > > 		goto pivot;
> > > > 	nr_pages = valid_swaphandles(...);
> > > > 	for (readahead loop)
> > > > 		...
> > > > pivot:
> > > > 	return read_swap_cache_async();
> > > > 
> > > > That will also save you the runtime initialization of nr_pages and
> > > > offset completely when the cgroup is active.  And you'll have only one
> > > > branch and no second one for offset < end_offset in the loop.  And the
> > > > lru draining, but I'm not sure about that.  I think it's not needed.
> > > > 
> > > Hmm. I'm not sure why lru_add_drain()->read_swap_cache_async() is inserted before returing
> > > to caller. Is the page to be returned isn't necessary to be on LRU ?
> > 
> > I'm not sure either.  Neither the fault handler nor concurrent
> > swap-ins seem to care.  I added Hugh on CC.
> 
> Thanks, though you've probably got me from git-blame identifying when
> I moved that code around: the person you really want is akpm, then
> @digeo.com, in ChangeLog-2.5.46:
> 
> 	[PATCH] empty the deferred lru-addition buffers in swapin_readahead
> 	
> 	If we're about to return to userspace after performing some swap
> 	readahead, the pages in the deferred-addition LRU queues could stay
> 	there for some time.  So drain them after performing readahead.
> 
> I suspect that's a "seems like a good idea, especially if we've many cpus"
> (which I do agree with), rather than a practical finding in some workload.
> If we've read in a number of pages which quite possibly will prove of no
> use to anyone, better have them visible to reclaim on the LRU as soon as
> possible, rather than stuck indefinitely in per-cpu vectors.
> 
> The non-readahead case is a little different, in that you know the one
> page is really of use to someone; so it's less important to drain in
> that case, but whether worth avoiding the drain I don't know.
> 
Thank you very much ! It seems lru_add_drain is not necessary, here.


> (On the general matter of these patches: mostly these days I find no
> time to do better than let the memcg people go their own way.  I do
> like these minimal patches much better than those putting blocks of
> #ifdef'ed code into mm/page_io.c and mm/vmscan.c etc.
And it was not perfect ;(

> But we'll need
> to see what how badly suppressing readahead works out - as I've said
> before, I'm no devout believer in readahead here, but have observed
> in the past that it really does help.  I always thought that to handle
> swapin readahead correctly, the memcg people would need to record the
> cgs of what's on swap.)
> 

IIUC, we record it. But to use the record correctly, we have to "charge" newly
swapped-in pages which may not be used and may be under other cgroup which
we are not under. With big modification, we may be able to charge at
add_to_swap_cache(). But in this case, (when page_cluster=3)

  swp_entry + 0  -> cg0
  swp_entry + 1  -> cg1
  ..............
  swp_entry + 7  -> cg7

We may have to charge 8cgs and has to call try_to_free_pages() in 8cgs with
risk of OOM in other cgroup.

What I think ideal is
at swap_readahead()
==
  swp_entry + 0  -> special cg (or lru) for unused swap
  swp_entry + 1  -> special cg (or lru) for unused swap
  ...
  swp_entry + 7  -> special cg (or lru) for unused swap
==
And move SwapCaches in sepcial cg to real cg when they are mapped
(or added to shmem's address space).

But, now, I can't find a good way to add "special cg" without complicated races
and tons of codes. (Then, bugfix in this way will add more new bugs, I think.)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
  2009-05-13  0:31     ` KAMEZAWA Hiroyuki
@ 2009-05-14 23:47       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-14 23:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, linux-mm, nishimura, akpm, mingo, linux-kernel

On Wed, 13 May 2009 09:31:27 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 12 May 2009 15:21:58 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > The patch set includes followng
> > >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> > >  [2/3] fix swap cache handling race by avoidng readahead.
> > >  [3/3] fix swap cache handling race by check swapcount again.
> > > 
> > > Result is good under my test.
> > 
> > What was the result (performance data impact) of disabling swap
> > readahead? Otherwise, this looks the most reasonable set of patches
> > for this problem.
> > 
> I'll measure some and report it in the next post.
> 
I confirmed there are cases which swapin readahead works very well....

Nishimura-san, could you post a patch for fixing leak at writeback ? as [3/3]
I'd like to fix readahead case...with some large patch.

Hm, I didn't think this problem took 2 months to be fixed ;(

-Kame


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
@ 2009-05-14 23:47       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-14 23:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, linux-mm, nishimura, akpm, mingo, linux-kernel

On Wed, 13 May 2009 09:31:27 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 12 May 2009 15:21:58 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > The patch set includes followng
> > >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> > >  [2/3] fix swap cache handling race by avoidng readahead.
> > >  [3/3] fix swap cache handling race by check swapcount again.
> > > 
> > > Result is good under my test.
> > 
> > What was the result (performance data impact) of disabling swap
> > readahead? Otherwise, this looks the most reasonable set of patches
> > for this problem.
> > 
> I'll measure some and report it in the next post.
> 
I confirmed there are cases which swapin readahead works very well....

Nishimura-san, could you post a patch for fixing leak at writeback ? as [3/3]
I'd like to fix readahead case...with some large patch.

Hm, I didn't think this problem took 2 months to be fixed ;(

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
  2009-05-14 23:47       ` KAMEZAWA Hiroyuki
@ 2009-05-15  0:38         ` Daisuke Nishimura
  -1 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-15  0:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, balbir, linux-mm, akpm, mingo, linux-kernel

On Fri, 15 May 2009 08:47:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 13 May 2009 09:31:27 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 12 May 2009 15:21:58 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > > The patch set includes followng
> > > >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> > > >  [2/3] fix swap cache handling race by avoidng readahead.
> > > >  [3/3] fix swap cache handling race by check swapcount again.
> > > > 
> > > > Result is good under my test.
> > > 
> > > What was the result (performance data impact) of disabling swap
> > > readahead? Otherwise, this looks the most reasonable set of patches
> > > for this problem.
> > > 
> > I'll measure some and report it in the next post.
> > 
> I confirmed there are cases which swapin readahead works very well....
> 
> Nishimura-san, could you post a patch for fixing leak at writeback ? as [3/3]
> I'd like to fix readahead case...with some large patch.
> 
Sure.
I'll rebase my patch onto [1-2/3] of your new patch and post it.

> Hm, I didn't think this problem took 2 months to be fixed ;(
> 
I didn't think so either.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
@ 2009-05-15  0:38         ` Daisuke Nishimura
  0 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-15  0:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, balbir, linux-mm, akpm, mingo, linux-kernel

On Fri, 15 May 2009 08:47:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 13 May 2009 09:31:27 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 12 May 2009 15:21:58 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > > The patch set includes followng
> > > >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> > > >  [2/3] fix swap cache handling race by avoidng readahead.
> > > >  [3/3] fix swap cache handling race by check swapcount again.
> > > > 
> > > > Result is good under my test.
> > > 
> > > What was the result (performance data impact) of disabling swap
> > > readahead? Otherwise, this looks the most reasonable set of patches
> > > for this problem.
> > > 
> > I'll measure some and report it in the next post.
> > 
> I confirmed there are cases which swapin readahead works very well....
> 
> Nishimura-san, could you post a patch for fixing leak at writeback ? as [3/3]
> I'd like to fix readahead case...with some large patch.
> 
Sure.
I'll rebase my patch onto [1-2/3] of your new patch and post it.

> Hm, I didn't think this problem took 2 months to be fixed ;(
> 
I didn't think so either.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
  2009-05-15  0:38         ` Daisuke Nishimura
@ 2009-05-15  0:54           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-15  0:54 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: balbir, linux-mm, akpm, mingo, linux-kernel

On Fri, 15 May 2009 09:38:53 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 15 May 2009 08:47:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 13 May 2009 09:31:27 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Tue, 12 May 2009 15:21:58 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > > The patch set includes followng
> > > > >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> > > > >  [2/3] fix swap cache handling race by avoidng readahead.
> > > > >  [3/3] fix swap cache handling race by check swapcount again.
> > > > > 
> > > > > Result is good under my test.
> > > > 
> > > > What was the result (performance data impact) of disabling swap
> > > > readahead? Otherwise, this looks the most reasonable set of patches
> > > > for this problem.
> > > > 
> > > I'll measure some and report it in the next post.
> > > 
> > I confirmed there are cases which swapin readahead works very well....
> > 
> > Nishimura-san, could you post a patch for fixing leak at writeback ? as [3/3]
> > I'd like to fix readahead case...with some large patch.
> > 
> Sure.
> I'll rebase my patch onto [1-2/3] of your new patch and post it.
> 
Ah, plz go ahead and don't wait for me. Mine is just under rough design now.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
@ 2009-05-15  0:54           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 52+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-15  0:54 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: balbir, linux-mm, akpm, mingo, linux-kernel

On Fri, 15 May 2009 09:38:53 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 15 May 2009 08:47:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 13 May 2009 09:31:27 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Tue, 12 May 2009 15:21:58 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > > The patch set includes followng
> > > > >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> > > > >  [2/3] fix swap cache handling race by avoidng readahead.
> > > > >  [3/3] fix swap cache handling race by check swapcount again.
> > > > > 
> > > > > Result is good under my test.
> > > > 
> > > > What was the result (performance data impact) of disabling swap
> > > > readahead? Otherwise, this looks the most reasonable set of patches
> > > > for this problem.
> > > > 
> > > I'll measure some and report it in the next post.
> > > 
> > I confirmed there are cases which swapin readahead works very well....
> > 
> > Nishimura-san, could you post a patch for fixing leak at writeback ? as [3/3]
> > I'd like to fix readahead case...with some large patch.
> > 
> Sure.
> I'll rebase my patch onto [1-2/3] of your new patch and post it.
> 
Ah, plz go ahead and don't wait for me. Mine is just under rough design now.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
  2009-05-15  0:54           ` KAMEZAWA Hiroyuki
@ 2009-05-15  1:12             ` Daisuke Nishimura
  -1 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-15  1:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, balbir, linux-mm, akpm, mingo, linux-kernel

On Fri, 15 May 2009 09:54:45 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 15 May 2009 09:38:53 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Fri, 15 May 2009 08:47:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Wed, 13 May 2009 09:31:27 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Tue, 12 May 2009 15:21:58 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > > The patch set includes followng
> > > > > >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> > > > > >  [2/3] fix swap cache handling race by avoidng readahead.
> > > > > >  [3/3] fix swap cache handling race by check swapcount again.
> > > > > > 
> > > > > > Result is good under my test.
> > > > > 
> > > > > What was the result (performance data impact) of disabling swap
> > > > > readahead? Otherwise, this looks the most reasonable set of patches
> > > > > for this problem.
> > > > > 
> > > > I'll measure some and report it in the next post.
> > > > 
> > > I confirmed there are cases which swapin readahead works very well....
> > > 
> > > Nishimura-san, could you post a patch for fixing leak at writeback ? as [3/3]
> > > I'd like to fix readahead case...with some large patch.
> > > 
> > Sure.
> > I'll rebase my patch onto [1-2/3] of your new patch and post it.
> > 
> Ah, plz go ahead and don't wait for me. Mine is just under rough design now.
> 
I see.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] fix stale swap cache account leak  in memcg v7
@ 2009-05-15  1:12             ` Daisuke Nishimura
  0 siblings, 0 replies; 52+ messages in thread
From: Daisuke Nishimura @ 2009-05-15  1:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, balbir, linux-mm, akpm, mingo, linux-kernel

On Fri, 15 May 2009 09:54:45 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 15 May 2009 09:38:53 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Fri, 15 May 2009 08:47:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Wed, 13 May 2009 09:31:27 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Tue, 12 May 2009 15:21:58 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > > The patch set includes followng
> > > > > >  [1/3] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
> > > > > >  [2/3] fix swap cache handling race by avoidng readahead.
> > > > > >  [3/3] fix swap cache handling race by check swapcount again.
> > > > > > 
> > > > > > Result is good under my test.
> > > > > 
> > > > > What was the result (performance data impact) of disabling swap
> > > > > readahead? Otherwise, this looks the most reasonable set of patches
> > > > > for this problem.
> > > > > 
> > > > I'll measure some and report it in the next post.
> > > > 
> > > I confirmed there are cases which swapin readahead works very well....
> > > 
> > > Nishimura-san, could you post a patch for fixing leak at writeback ? as [3/3]
> > > I'd like to fix readahead case...with some large patch.
> > > 
> > Sure.
> > I'll rebase my patch onto [1-2/3] of your new patch and post it.
> > 
> Ah, plz go ahead and don't wait for me. Mine is just under rough design now.
> 
I see.

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2009-05-15  1:17 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-12  1:44 [PATCH 0/3] fix stale swap cache account leak in memcg v7 KAMEZAWA Hiroyuki
2009-05-12  1:44 ` KAMEZAWA Hiroyuki
2009-05-12  1:45 ` [PATCH 1/3] add check for mem cgroup is activated KAMEZAWA Hiroyuki
2009-05-12  1:45   ` KAMEZAWA Hiroyuki
2009-05-12  1:46 ` [PATCH 2/3] fix swap cache account leak at swapin-readahead KAMEZAWA Hiroyuki
2009-05-12  1:46   ` KAMEZAWA Hiroyuki
2009-05-12  4:32   ` Daisuke Nishimura
2009-05-12  4:32     ` Daisuke Nishimura
2009-05-12 11:24   ` Johannes Weiner
2009-05-12 11:24     ` Johannes Weiner
2009-05-12 23:58     ` KAMEZAWA Hiroyuki
2009-05-12 23:58       ` KAMEZAWA Hiroyuki
2009-05-13 11:18       ` Johannes Weiner
2009-05-13 11:18         ` Johannes Weiner
2009-05-13 18:03         ` Hugh Dickins
2009-05-13 18:03           ` Hugh Dickins
2009-05-14  0:05           ` KAMEZAWA Hiroyuki
2009-05-14  0:05             ` KAMEZAWA Hiroyuki
2009-05-12  1:47 ` [PATCH 3/3] fix stale swap cache at writeback KAMEZAWA Hiroyuki
2009-05-12  1:47   ` KAMEZAWA Hiroyuki
2009-05-12  5:06 ` [PATCH 4/3] memcg: call uncharge_swapcache outside of tree_lock (Re: [PATCH 0/3] fix stale swap cache account leak in memcg v7) Daisuke Nishimura
2009-05-12  5:06   ` Daisuke Nishimura
2009-05-12  7:09   ` KAMEZAWA Hiroyuki
2009-05-12  7:09     ` KAMEZAWA Hiroyuki
2009-05-12  8:00     ` Daisuke Nishimura
2009-05-12  8:00       ` Daisuke Nishimura
2009-05-12  8:13       ` [PATCH][BUGFIX] memcg: fix for deadlock between lock_page_cgroup and mapping tree_lock KAMEZAWA Hiroyuki
2009-05-12  8:13         ` KAMEZAWA Hiroyuki
2009-05-12 10:58         ` Daisuke Nishimura
2009-05-12 10:58           ` Daisuke Nishimura
2009-05-12 23:59           ` KAMEZAWA Hiroyuki
2009-05-12 23:59             ` KAMEZAWA Hiroyuki
2009-05-13  0:28             ` Daisuke Nishimura
2009-05-13  0:28               ` Daisuke Nishimura
2009-05-13  0:32               ` KAMEZAWA Hiroyuki
2009-05-13  0:32                 ` KAMEZAWA Hiroyuki
2009-05-13  3:55                 ` KAMEZAWA Hiroyuki
2009-05-13  3:55                   ` KAMEZAWA Hiroyuki
2009-05-13  4:11                   ` nishimura
2009-05-13  4:11                     ` nishimura
2009-05-12  9:51 ` [PATCH 0/3] fix stale swap cache account leak in memcg v7 Balbir Singh
2009-05-12  9:51   ` Balbir Singh
2009-05-13  0:31   ` KAMEZAWA Hiroyuki
2009-05-13  0:31     ` KAMEZAWA Hiroyuki
2009-05-14 23:47     ` KAMEZAWA Hiroyuki
2009-05-14 23:47       ` KAMEZAWA Hiroyuki
2009-05-15  0:38       ` Daisuke Nishimura
2009-05-15  0:38         ` Daisuke Nishimura
2009-05-15  0:54         ` KAMEZAWA Hiroyuki
2009-05-15  0:54           ` KAMEZAWA Hiroyuki
2009-05-15  1:12           ` Daisuke Nishimura
2009-05-15  1:12             ` Daisuke Nishimura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.