All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] memcg: handle swapcache leak
@ 2009-03-17  4:57 Daisuke Nishimura
  2009-03-17  5:39 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-17  4:57 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Hugh Dickins, Daisuke Nishimura

Hi.

There are (at least) 2 types(described later) of swapcache leak in current memcg.

I mean by "swapcache leak" a swapcache which:
  a. the process that used the page has already exited(or
     unmapped the page).
  b. is not linked to memcg's LRU because the page is !PageCgroupUsed.

So, only the global page reclaim or swapoff can free these leaked swapcaches.
This means memcg's memory pressure can use up all swap entries if
the memory size of the system is greater than that of swap.

1. race between exit and swap-in
  Assume processA is exitting and processB is doing swap-in.

  If some pages of processA has been swapped out, it calls free_swap_and_cache().
  And if at the same time, processB is calling read_swap_cache_async() about
  a swap entry *that is used by processA*, a race like below can happen.

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
      swap_entry_free() == 0           |
      find_get_page() -> found         |
      try_lock_page() -> fail & return |
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.

  This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.

  And this type of leaked swapcaches have been charged as swap,
  so swap entries of them have reference to the associated memcg
  and the refcnt of the memcg has been incremented.
  As a result this memcg cannot be free'ed until global page reclaim
  frees this swapcache or swapoff is executed.

  Actually, I saw "struct mem_cgroup leak"(checked by "grep kmalloc-1024 /proc/slabinfo")
  in my test, where I create a new directory, move all tasks to the new
  directory, and remove the old directory under memcg's memory pressure.
  And, this "struct mem_cgroup leak" didn't happen with setting
  /proc/sys/vm/page-cluster to 0.

2. race between exit and swap-out
  If page_remove_rmap() is called by the owner process about an anonymous
  page(not on swapchache, so uncharged here) before shrink_page_list() adds
  the page to swapcache, this page becomes a swapcache with !PageCgroupUsed.

  And if this swapcache is not free'ed by shrink_page_list(), it goes back
  to global LRU, but doesn't go back to memcg's LRU because the page is
  !PageCgroupUsed.

  This type of leak can be avoided by modifying shrink_page_list() like:

===
@@ -775,6 +776,21 @@ activate_locked:
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
+		if (!scanning_global_lru(sc) && PageSwapCache(page)) {
+			struct page_cgroup *pc;
+
+			pc = lookup_page_cgroup(page);
+			/*
+			 * Used bit of swapcache is solid under page lock.
+			 */
+			if (unlikely(!PageCgroupUsed(pc)))
+				/*
+				 * This can happen if the page is unmapped by
+				 * the owner process before it is added to
+				 * swapcache.
+				 */
+				try_to_free_swap(page);
+		}
 		unlock_page(page);
 keep:
 		list_add(&page->lru, &ret_pages);
===


I've confirmed that no leak happens with this patch for shrink_page_list() applied
and setting /proc/sys/vm/page-cluster to 0 in a simple swap in/out test.
(I think I should check page migration and rmdir too.)

I think the root cause of these problem is that !PageCgroupUsed pages are not linked
to any memcg's LRU.
So, I'm tring to implement "dummy_memcg" to maintain !PageCgroupUsed pages now.

Any comments or suggestions would be welcome.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-17  4:57 [RFC] memcg: handle swapcache leak Daisuke Nishimura
@ 2009-03-17  5:39 ` KAMEZAWA Hiroyuki
  2009-03-17  6:11   ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  5:39 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Hugh Dickins

On Tue, 17 Mar 2009 13:57:02 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi.
> 
> There are (at least) 2 types(described later) of swapcache leak in current memcg.
> 
> I mean by "swapcache leak" a swapcache which:
>   a. the process that used the page has already exited(or
>      unmapped the page).
>   b. is not linked to memcg's LRU because the page is !PageCgroupUsed.
> 
> So, only the global page reclaim or swapoff can free these leaked swapcaches.
> This means memcg's memory pressure can use up all swap entries if
> the memory size of the system is greater than that of swap.
> 
> 1. race between exit and swap-in
>   Assume processA is exitting and processB is doing swap-in.
> 
>   If some pages of processA has been swapped out, it calls free_swap_and_cache().
>   And if at the same time, processB is calling read_swap_cache_async() about
>   a swap entry *that is used by processA*, a race like below can happen.
> 
>             processA                   |           processB
>   -------------------------------------+-------------------------------------
>     (free_swap_and_cache())            |  (read_swap_cache_async())
>                                        |    swap_duplicate()
>                                        |    __set_page_locked()
>                                        |    add_to_swap_cache()
>       swap_entry_free() == 0           |
                          == 1?
>       find_get_page() -> found         |
>       try_lock_page() -> fail & return |
>                                        |    lru_cache_add_anon()
>                                        |      doesn't link this page to memcg's
>                                        |      LRU, because of !PageCgroupUsed.
> 
>   This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.
> 
>   And this type of leaked swapcaches have been charged as swap,
>   so swap entries of them have reference to the associated memcg
>   and the refcnt of the memcg has been incremented.
>   As a result this memcg cannot be free'ed until global page reclaim
>   frees this swapcache or swapoff is executed.
> 
Okay. can happen.

>   Actually, I saw "struct mem_cgroup leak"(checked by "grep kmalloc-1024 /proc/slabinfo")
>   in my test, where I create a new directory, move all tasks to the new
>   directory, and remove the old directory under memcg's memory pressure.
>   And, this "struct mem_cgroup leak" didn't happen with setting
>   /proc/sys/vm/page-cluster to 0.
> 

Hmm, but IHMO, this is not "leak". "leak" means the object will not be freed forever.
This is a "delay".

And I tend to allow this. (stale SwapCache will be on LRU until global LRU found it,
but it's not called leak.)



> 2. race between exit and swap-out
>   If page_remove_rmap() is called by the owner process about an anonymous
>   page(not on swapchache, so uncharged here) before shrink_page_list() adds
>   the page to swapcache, this page becomes a swapcache with !PageCgroupUsed.
> 
>   And if this swapcache is not free'ed by shrink_page_list(), it goes back
>   to global LRU, but doesn't go back to memcg's LRU because the page is
>   !PageCgroupUsed.
> 
>   This type of leak can be avoided by modifying shrink_page_list() like:
> 
> ===
> @@ -775,6 +776,21 @@ activate_locked:
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> +		if (!scanning_global_lru(sc) && PageSwapCache(page)) {
> +			struct page_cgroup *pc;
> +
> +			pc = lookup_page_cgroup(page);
> +			/*
> +			 * Used bit of swapcache is solid under page lock.
> +			 */
> +			if (unlikely(!PageCgroupUsed(pc)))
> +				/*
> +				 * This can happen if the page is unmapped by
> +				 * the owner process before it is added to
> +				 * swapcache.
> +				 */
> +				try_to_free_swap(page);
> +		}
>  		unlock_page(page);
>  keep:
>  		list_add(&page->lru, &ret_pages);
> ===
> 
> 
> I've confirmed that no leak happens with this patch for shrink_page_list() applied
> and setting /proc/sys/vm/page-cluster to 0 in a simple swap in/out test.
> (I think I should check page migration and rmdir too.)
> 

But this is also "delay", isn't it ?

I think both "delay" comes from nature of current LRU desgin which allows small window
of this kinds. But there is no "leak". 

IMHO, I tend to allow this kinds of "delay" considering trade-off.

I have no troubles if rmdir() can success.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-17  5:39 ` KAMEZAWA Hiroyuki
@ 2009-03-17  6:11   ` Daisuke Nishimura
  2009-03-17  7:29     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-17  6:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Tue, 17 Mar 2009 14:39:03 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 17 Mar 2009 13:57:02 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > Hi.
> > 
> > There are (at least) 2 types(described later) of swapcache leak in current memcg.
> > 
> > I mean by "swapcache leak" a swapcache which:
> >   a. the process that used the page has already exited(or
> >      unmapped the page).
> >   b. is not linked to memcg's LRU because the page is !PageCgroupUsed.
> > 
> > So, only the global page reclaim or swapoff can free these leaked swapcaches.
> > This means memcg's memory pressure can use up all swap entries if
> > the memory size of the system is greater than that of swap.
> > 
> > 1. race between exit and swap-in
> >   Assume processA is exitting and processB is doing swap-in.
> > 
> >   If some pages of processA has been swapped out, it calls free_swap_and_cache().
> >   And if at the same time, processB is calling read_swap_cache_async() about
> >   a swap entry *that is used by processA*, a race like below can happen.
> > 
> >             processA                   |           processB
> >   -------------------------------------+-------------------------------------
> >     (free_swap_and_cache())            |  (read_swap_cache_async())
> >                                        |    swap_duplicate()
> >                                        |    __set_page_locked()
> >                                        |    add_to_swap_cache()
> >       swap_entry_free() == 0           |
>                           == 1?
> >       find_get_page() -> found         |
> >       try_lock_page() -> fail & return |
> >                                        |    lru_cache_add_anon()
> >                                        |      doesn't link this page to memcg's
> >                                        |      LRU, because of !PageCgroupUsed.
> > 
> >   This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.
> > 
> >   And this type of leaked swapcaches have been charged as swap,
> >   so swap entries of them have reference to the associated memcg
> >   and the refcnt of the memcg has been incremented.
> >   As a result this memcg cannot be free'ed until global page reclaim
> >   frees this swapcache or swapoff is executed.
> > 
> Okay. can happen.
> 
> >   Actually, I saw "struct mem_cgroup leak"(checked by "grep kmalloc-1024 /proc/slabinfo")
> >   in my test, where I create a new directory, move all tasks to the new
> >   directory, and remove the old directory under memcg's memory pressure.
> >   And, this "struct mem_cgroup leak" didn't happen with setting
> >   /proc/sys/vm/page-cluster to 0.
> > 
> 
> Hmm, but IHMO, this is not "leak". "leak" means the object will not be freed forever.
> This is a "delay".
> 
> And I tend to allow this. (stale SwapCache will be on LRU until global LRU found it,
> but it's not called leak.)
> 
You're right, but memcg's reclaim doesn't scan global LRU,
so these swapcaches cannot be free'ed by memcg's reclaim.

This means that a system with memcg's memory pressure but without
global memory pressure can use up swap space as swapcaches, doesn't it ?
That's what I'm worrying about.


Thanks,
Daisuke Nishimura.

> 
> 
> > 2. race between exit and swap-out
> >   If page_remove_rmap() is called by the owner process about an anonymous
> >   page(not on swapchache, so uncharged here) before shrink_page_list() adds
> >   the page to swapcache, this page becomes a swapcache with !PageCgroupUsed.
> > 
> >   And if this swapcache is not free'ed by shrink_page_list(), it goes back
> >   to global LRU, but doesn't go back to memcg's LRU because the page is
> >   !PageCgroupUsed.
> > 
> >   This type of leak can be avoided by modifying shrink_page_list() like:
> > 
> > ===
> > @@ -775,6 +776,21 @@ activate_locked:
> >  		SetPageActive(page);
> >  		pgactivate++;
> >  keep_locked:
> > +		if (!scanning_global_lru(sc) && PageSwapCache(page)) {
> > +			struct page_cgroup *pc;
> > +
> > +			pc = lookup_page_cgroup(page);
> > +			/*
> > +			 * Used bit of swapcache is solid under page lock.
> > +			 */
> > +			if (unlikely(!PageCgroupUsed(pc)))
> > +				/*
> > +				 * This can happen if the page is unmapped by
> > +				 * the owner process before it is added to
> > +				 * swapcache.
> > +				 */
> > +				try_to_free_swap(page);
> > +		}
> >  		unlock_page(page);
> >  keep:
> >  		list_add(&page->lru, &ret_pages);
> > ===
> > 
> > 
> > I've confirmed that no leak happens with this patch for shrink_page_list() applied
> > and setting /proc/sys/vm/page-cluster to 0 in a simple swap in/out test.
> > (I think I should check page migration and rmdir too.)
> > 
> 
> But this is also "delay", isn't it ?
> 
> I think both "delay" comes from nature of current LRU desgin which allows small window
> of this kinds. But there is no "leak". 
> 
> IMHO, I tend to allow this kinds of "delay" considering trade-off.
> 
> I have no troubles if rmdir() can success.
> 
> Thanks,
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-17  6:11   ` Daisuke Nishimura
@ 2009-03-17  7:29     ` KAMEZAWA Hiroyuki
  2009-03-17  9:38       ` KAMEZAWA Hiroyuki
  2009-03-18  0:08       ` [RFC] memcg: handle swapcache leak Daisuke Nishimura
  0 siblings, 2 replies; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  7:29 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Hugh Dickins

On Tue, 17 Mar 2009 15:11:13 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:


> > Hmm, but IHMO, this is not "leak". "leak" means the object will not be freed forever.
> > This is a "delay".
> > 
> > And I tend to allow this. (stale SwapCache will be on LRU until global LRU found it,
> > but it's not called leak.)
> > 
> You're right, but memcg's reclaim doesn't scan global LRU,
> so these swapcaches cannot be free'ed by memcg's reclaim.
> 
right.

> This means that a system with memcg's memory pressure but without
> global memory pressure can use up swap space as swapcaches, doesn't it ?
> That's what I'm worrying about.
> 
This kind of behavior (don't add to LRU if !PageCgroupUsed()) is for swapin-readahead.
We need this hebavior.

We never see the swap is exhausted by this issue .....but yes, not 0%.

Without memcg, when the page is added to swap, global LRU runs, anyway.
With memcg, when the page is added to swap, global LRU will not runs.

Give me time, I'll find a fix.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-17  7:29     ` KAMEZAWA Hiroyuki
@ 2009-03-17  9:38       ` KAMEZAWA Hiroyuki
  2009-03-18  1:17         ` Daisuke Nishimura
  2009-03-18  0:08       ` [RFC] memcg: handle swapcache leak Daisuke Nishimura
  1 sibling, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  9:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Tue, 17 Mar 2009 16:29:50 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Give me time, I'll find a fix.
> 
Fowlling is  a result of quick hack, *not* tested.

but how this looks ? (please ignore garbage..)

==
---
 include/linux/page_cgroup.h |   13 ++++
 mm/memcontrol.c             |  140 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 146 insertions(+), 7 deletions(-)

Index: mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.29-Mar13.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
 static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+        { return test_and_set_bit(PCG_##lname, &pc->flags);}
+
+#define TESTCLEARPCGFLAG(uname, lname) \
+static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
+        { return test_and_clear_bit(PCG_##lname, &pc->flags);}
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
+TESTSETPCGFLAG(Orphan, ORPHAN)
+TESTCLEARPCGFLAG(Orphan, ORPHAN)
+
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
Index: mmotm-2.6.29-Mar13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Mar13.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Mar13/mm/memcontrol.c
@@ -204,11 +204,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 };
 
 /* for encoding cft->private value on file */
-#define _MEM			(0)
-#define _MEMSWAP		(1)
-#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
-#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
-#define MEMFILE_ATTR(val)	((val) & 0xffff)
+#define _MEM                   (0)
+#define _MEMSWAP               (1)
+#define MEMFILE_PRIVATE(x, val)        (((x) << 16) | (val))
+#define MEMFILE_TYPE(val)      (((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val)      ((val) & 0xffff)
+
+/* for orphan page_cgroups, guarded by zone->lock. */
+struct orphan_pcg_list {
+	struct list_head zone[MAX_NR_ZONES];
+};
+struct orphan_pcg_list *orphan_list[MAX_NUMNODES];
+atomic_t num_orphan_pages;
+
+static inline struct list_head *orphan_lru(int nid, int zid)
+{
+	/*
+	 * to kick this BUG_ON(), swapcache must be generated while init.
+	 * or NID should be invalid.
+	 */
+	BUG_ON(!orphan_list[nid]);
+	return  &orphan_list[nid]->zone[zid];
+}
+
 
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
@@ -380,6 +398,14 @@ void mem_cgroup_del_lru_list(struct page
 	if (mem_cgroup_disabled())
 		return;
 	pc = lookup_page_cgroup(page);
+	/*
+	 * If the page is SwapCache and already on global LRU, it will be on
+	 * orphan list. remove here
+	 */
+	if (unlikely(PageSwapCache(page) && TestClearPageCgroupOrphan(pc))) {
+		list_del_init(&pc->lru);
+		atomic_dec(&num_orphan_pages);
+	}
 	/* can happen while we handle swapcache. */
 	if (list_empty(&pc->lru) || !pc->mem_cgroup)
 		return;
@@ -414,7 +440,7 @@ void mem_cgroup_rotate_lru_list(struct p
 	 */
 	smp_rmb();
 	/* unused page is not rotated. */
-	if (!PageCgroupUsed(pc))
+	if (unlikely(!PageCgroupUsed(pc)))
 		return;
 	mz = page_cgroup_zoneinfo(pc);
 	list_move(&pc->lru, &mz->lists[lru]);
@@ -433,8 +459,15 @@ void mem_cgroup_add_lru_list(struct page
 	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
 	 */
 	smp_rmb();
-	if (!PageCgroupUsed(pc))
+	if (unlikely(!PageCgroupUsed(pc))) {
+		if (PageSwapCache(page) && !TestSetPageCgroupOrphan(pc)) {
+			struct list_head *lru;
+			lru = orphan_lru(page_to_nid(page), page_zonenum(page));
+			list_add_tail(&pc->lru, lru);
+			atomic_inc(&num_orphan_pages);
+		}
 		return;
+	}
 
 	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -784,6 +817,95 @@ static int mem_cgroup_count_children(str
 	return num;
 }
 
+
+
+/* Using big number here for avoiding to free swap-cache of readahead. */
+#define CHECK_ORPHAN_THRESH  (4096)
+
+static __init void init_orphan_lru(void)
+{
+	struct orphan_pcg_list *opl;
+	int nid, zid;
+
+	for_each_node_state(nid, N_POSSIBLE) {
+		opl = kmalloc(sizeof(struct orphan_pcg_list),  GFP_KERNEL);
+		BUG_ON(!opl);
+		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+			INIT_LIST_HEAD(&opl->zone[zid]);
+		orphan_list[nid] = opl;
+	}
+}
+/* 
+ * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
+ * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
+ */
+static int drain_orphan_swapcaches(int nid, int zid)
+{
+	struct page_cgroup *pc;
+	struct zone *zone;
+	struct page *page;
+	struct list_head *lru = orphan_lru(nid, zid);
+	unsigned long flags;
+	int drain, scan;
+
+	zone = &NODE_DATA(nid)->node_zones[zid];
+	/* check one by one */
+	scan = 0;
+	drain = 0;
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	while (!list_empty(lru) && (scan < SWAP_CLUSTER_MAX*2)) {
+		scan++;
+		pc = list_entry(lru->next, struct page_cgroup, lru);
+		page = pc->page;
+		/* Rotate */
+		list_del(&pc->lru);
+		list_add_tail(&pc->lru, lru);
+		/* get page for isolate_lru_page() */
+		if (get_page_unless_zero(page)) {
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			if (!isolate_lru_page(page)) {
+				/* This page is not ON LRU */
+				if (trylock_page(page)) {
+					drain += try_to_free_swap(page);
+					unlock_page(page);
+				}
+				putback_lru_page(page);
+			}
+			put_page(page);
+			spin_lock_irqsave(&zone->lru_lock, flags);		
+		}
+	}
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+
+	return drain;
+}
+
+static int last_visit;
+void check_stale_swapcaches(void)
+{
+	int nid, zid, drain;
+	
+	nid = last_visit;
+	drain = 0;
+	
+	if (atomic_read(&num_orphan_pages) < CHECK_ORPHAN_THRESH)
+		return;
+		
+again:
+	nid = next_node(nid, node_states[N_HIGH_MEMORY]);
+	if (nid == MAX_NUMNODES) {
+		nid = 0;
+		if (!node_state(nid, N_HIGH_MEMORY))
+			goto again;
+	}
+	last_visit = nid;
+
+	for (zid = 0; !drain && zid < MAX_NR_ZONES; zid++)
+		drain += drain_orphan_swapcaches(nid, zid);
+}
+
+
+
 /*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -842,6 +964,9 @@ static int mem_cgroup_hierarchical_recla
 	int ret, total = 0;
 	int loop = 0;
 
+	if (vm_swap_full())
+		check_stale_swapcaches();
+
 	while (loop < 2) {
 		victim = mem_cgroup_select_victim(root_mem);
 		if (victim == root_mem)
@@ -2454,6 +2579,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	/* root ? */
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
+		init_orphan_lru();
 		parent = NULL;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-17  7:29     ` KAMEZAWA Hiroyuki
  2009-03-17  9:38       ` KAMEZAWA Hiroyuki
@ 2009-03-18  0:08       ` Daisuke Nishimura
  1 sibling, 0 replies; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-18  0:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Tue, 17 Mar 2009 16:29:50 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 17 Mar 2009 15:11:13 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> 
> > > Hmm, but IHMO, this is not "leak". "leak" means the object will not be freed forever.
> > > This is a "delay".
> > > 
> > > And I tend to allow this. (stale SwapCache will be on LRU until global LRU found it,
> > > but it's not called leak.)
> > > 
> > You're right, but memcg's reclaim doesn't scan global LRU,
> > so these swapcaches cannot be free'ed by memcg's reclaim.
> > 
> right.
> 
> > This means that a system with memcg's memory pressure but without
> > global memory pressure can use up swap space as swapcaches, doesn't it ?
> > That's what I'm worrying about.
> > 
> This kind of behavior (don't add to LRU if !PageCgroupUsed()) is for swapin-readahead.
> We need this hebavior.
> 
> We never see the swap is exhausted by this issue .....but yes, not 0%.
> 
Just FYI.
I run 5 programs last night, which uses 8MB each, with mem.limit=32M
and 30MB swap on system.
All swap space are used up by swapcache and some programs are oom'ed.


Thanks,
Daisuke Nishimura.

> Without memcg, when the page is added to swap, global LRU runs, anyway.
> With memcg, when the page is added to swap, global LRU will not runs.
> 
> Give me time, I'll find a fix.
> 
> Thanks,
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-17  9:38       ` KAMEZAWA Hiroyuki
@ 2009-03-18  1:17         ` Daisuke Nishimura
  2009-03-18  1:34           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-18  1:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Tue, 17 Mar 2009 18:38:50 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 17 Mar 2009 16:29:50 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Give me time, I'll find a fix.
> > 
> Fowlling is  a result of quick hack, *not* tested.
> 
> but how this looks ? (please ignore garbage..)
> 
I agree to manage special lists per zones to handle these swapcaches.

A few comments are below.

> ==
> ---
>  include/linux/page_cgroup.h |   13 ++++
>  mm/memcontrol.c             |  140 +++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 146 insertions(+), 7 deletions(-)
> 
> Index: mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.29-Mar13.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> @@ -26,6 +26,7 @@ enum {
>  	PCG_LOCK,  /* page cgroup is locked */
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
> +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
>  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname) \
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> +        { return test_and_set_bit(PCG_##lname, &pc->flags);}
> +
> +#define TESTCLEARPCGFLAG(uname, lname) \
> +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> +        { return test_and_clear_bit(PCG_##lname, &pc->flags);}
> +
>  /* Cache flag is set only once (at allocation) */
>  TESTPCGFLAG(Cache, CACHE)
>  
>  TESTPCGFLAG(Used, USED)
>  CLEARPCGFLAG(Used, USED)
>  
> +TESTSETPCGFLAG(Orphan, ORPHAN)
> +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> +
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> Index: mmotm-2.6.29-Mar13/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.29-Mar13.orig/mm/memcontrol.c
> +++ mmotm-2.6.29-Mar13/mm/memcontrol.c
> @@ -204,11 +204,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
>  };
>  
>  /* for encoding cft->private value on file */
> -#define _MEM			(0)
> -#define _MEMSWAP		(1)
> -#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
> -#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
> -#define MEMFILE_ATTR(val)	((val) & 0xffff)
> +#define _MEM                   (0)
> +#define _MEMSWAP               (1)
> +#define MEMFILE_PRIVATE(x, val)        (((x) << 16) | (val))
> +#define MEMFILE_TYPE(val)      (((val) >> 16) & 0xffff)
> +#define MEMFILE_ATTR(val)      ((val) & 0xffff)
> +
> +/* for orphan page_cgroups, guarded by zone->lock. */
> +struct orphan_pcg_list {
> +	struct list_head zone[MAX_NR_ZONES];
> +};
> +struct orphan_pcg_list *orphan_list[MAX_NUMNODES];
> +atomic_t num_orphan_pages;
> +
> +static inline struct list_head *orphan_lru(int nid, int zid)
> +{
> +	/*
> +	 * to kick this BUG_ON(), swapcache must be generated while init.
> +	 * or NID should be invalid.
> +	 */
> +	BUG_ON(!orphan_list[nid]);
> +	return  &orphan_list[nid]->zone[zid];
> +}
> +
>  
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
> @@ -380,6 +398,14 @@ void mem_cgroup_del_lru_list(struct page
>  	if (mem_cgroup_disabled())
>  		return;
>  	pc = lookup_page_cgroup(page);
> +	/*
> +	 * If the page is SwapCache and already on global LRU, it will be on
> +	 * orphan list. remove here
> +	 */
> +	if (unlikely(PageSwapCache(page) && TestClearPageCgroupOrphan(pc))) {
> +		list_del_init(&pc->lru);
> +		atomic_dec(&num_orphan_pages);
> +	}
>  	/* can happen while we handle swapcache. */
>  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
>  		return;
Is this check "PageSwapCache(page)" needed ?

What happens, for example, if a swapcache which has been swaped-in by readahead
and has not been mapped by the owner process is zapped by the process ?
IIUC, free_swap_and_cache() removes the page from swapcache before the page is
removed from LRU.

> @@ -414,7 +440,7 @@ void mem_cgroup_rotate_lru_list(struct p
>  	 */
>  	smp_rmb();
>  	/* unused page is not rotated. */
> -	if (!PageCgroupUsed(pc))
> +	if (unlikely(!PageCgroupUsed(pc)))
>  		return;
>  	mz = page_cgroup_zoneinfo(pc);
>  	list_move(&pc->lru, &mz->lists[lru]);
> @@ -433,8 +459,15 @@ void mem_cgroup_add_lru_list(struct page
>  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
>  	 */
>  	smp_rmb();
> -	if (!PageCgroupUsed(pc))
> +	if (unlikely(!PageCgroupUsed(pc))) {
> +		if (PageSwapCache(page) && !TestSetPageCgroupOrphan(pc)) {
> +			struct list_head *lru;
> +			lru = orphan_lru(page_to_nid(page), page_zonenum(page));
> +			list_add_tail(&pc->lru, lru);
> +			atomic_inc(&num_orphan_pages);
> +		}
>  		return;
> +	}
>  
>  	mz = page_cgroup_zoneinfo(pc);
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> @@ -784,6 +817,95 @@ static int mem_cgroup_count_children(str
>  	return num;
>  }
>  
> +
> +
> +/* Using big number here for avoiding to free swap-cache of readahead. */
> +#define CHECK_ORPHAN_THRESH  (4096)
> +
> +static __init void init_orphan_lru(void)
> +{
> +	struct orphan_pcg_list *opl;
> +	int nid, zid;
> +
> +	for_each_node_state(nid, N_POSSIBLE) {
> +		opl = kmalloc(sizeof(struct orphan_pcg_list),  GFP_KERNEL);
> +		BUG_ON(!opl);
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			INIT_LIST_HEAD(&opl->zone[zid]);
> +		orphan_list[nid] = opl;
> +	}
> +}
> +/* 
> + * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
> + * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
> + */
> +static int drain_orphan_swapcaches(int nid, int zid)
> +{
> +	struct page_cgroup *pc;
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *lru = orphan_lru(nid, zid);
> +	unsigned long flags;
> +	int drain, scan;
> +
> +	zone = &NODE_DATA(nid)->node_zones[zid];
> +	/* check one by one */
> +	scan = 0;
> +	drain = 0;
> +	spin_lock_irqsave(&zone->lru_lock, flags);
> +	while (!list_empty(lru) && (scan < SWAP_CLUSTER_MAX*2)) {
> +		scan++;
> +		pc = list_entry(lru->next, struct page_cgroup, lru);
> +		page = pc->page;
> +		/* Rotate */
> +		list_del(&pc->lru);
> +		list_add_tail(&pc->lru, lru);
> +		/* get page for isolate_lru_page() */
> +		if (get_page_unless_zero(page)) {
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			if (!isolate_lru_page(page)) {
> +				/* This page is not ON LRU */
> +				if (trylock_page(page)) {
> +					drain += try_to_free_swap(page);
> +					unlock_page(page);
> +				}
> +				putback_lru_page(page);
> +			}
> +			put_page(page);
> +			spin_lock_irqsave(&zone->lru_lock, flags);		
> +		}
> +	}
> +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> +
> +	return drain;
> +}
> +
> +static int last_visit;
> +void check_stale_swapcaches(void)
> +{
> +	int nid, zid, drain;
> +	
> +	nid = last_visit;
> +	drain = 0;
> +	
> +	if (atomic_read(&num_orphan_pages) < CHECK_ORPHAN_THRESH)
> +		return;
> +		
> +again:
> +	nid = next_node(nid, node_states[N_HIGH_MEMORY]);
> +	if (nid == MAX_NUMNODES) {
> +		nid = 0;
> +		if (!node_state(nid, N_HIGH_MEMORY))
> +			goto again;
> +	}
> +	last_visit = nid;
> +
> +	for (zid = 0; !drain && zid < MAX_NR_ZONES; zid++)
> +		drain += drain_orphan_swapcaches(nid, zid);
> +}
> +
> +
> +
>  /*
>   * Visit the first child (need not be the first child as per the ordering
>   * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -842,6 +964,9 @@ static int mem_cgroup_hierarchical_recla
>  	int ret, total = 0;
>  	int loop = 0;
>  
> +	if (vm_swap_full())
> +		check_stale_swapcaches();
> +
hmm... vm_swap_full() would be enough from the kernel point of view, but
users can see the "leak" if the swapsize is big(I don't want to here
"hey! something is leaking!").

Just removing vm_swap_full() is enough, isn't it ?
check_stale_swapcaches() checks the threshhold.


Thanks,
Dausike Nishimura.

>  	while (loop < 2) {
>  		victim = mem_cgroup_select_victim(root_mem);
>  		if (victim == root_mem)
> @@ -2454,6 +2579,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  	/* root ? */
>  	if (cont->parent == NULL) {
>  		enable_swap_cgroup();
> +		init_orphan_lru();
>  		parent = NULL;
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-18  1:17         ` Daisuke Nishimura
@ 2009-03-18  1:34           ` KAMEZAWA Hiroyuki
  2009-03-18  3:51             ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-18  1:34 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Hugh Dickins

On Wed, 18 Mar 2009 10:17:27 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > ==
> > ---
> >  include/linux/page_cgroup.h |   13 ++++
> >  mm/memcontrol.c             |  140 +++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 146 insertions(+), 7 deletions(-)
> > 
> > Index: mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-2.6.29-Mar13.orig/include/linux/page_cgroup.h
> > +++ mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> > @@ -26,6 +26,7 @@ enum {
> >  	PCG_LOCK,  /* page cgroup is locked */
> >  	PCG_CACHE, /* charged as cache */
> >  	PCG_USED, /* this object is in use. */
> > +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
> >  };
> >  
> >  #define TESTPCGFLAG(uname, lname)			\
> > @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
> >  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
> >  	{ clear_bit(PCG_##lname, &pc->flags);  }
> >  
> > +#define TESTSETPCGFLAG(uname, lname) \
> > +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> > +        { return test_and_set_bit(PCG_##lname, &pc->flags);}
> > +
> > +#define TESTCLEARPCGFLAG(uname, lname) \
> > +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> > +        { return test_and_clear_bit(PCG_##lname, &pc->flags);}
> > +
> >  /* Cache flag is set only once (at allocation) */
> >  TESTPCGFLAG(Cache, CACHE)
> >  
> >  TESTPCGFLAG(Used, USED)
> >  CLEARPCGFLAG(Used, USED)
> >  
> > +TESTSETPCGFLAG(Orphan, ORPHAN)
> > +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> > +
> > +
> >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> >  	return page_to_nid(pc->page);
> > Index: mmotm-2.6.29-Mar13/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.29-Mar13.orig/mm/memcontrol.c
> > +++ mmotm-2.6.29-Mar13/mm/memcontrol.c
> > @@ -204,11 +204,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
> >  };
> >  
> >  /* for encoding cft->private value on file */
> > -#define _MEM			(0)
> > -#define _MEMSWAP		(1)
> > -#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
> > -#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
> > -#define MEMFILE_ATTR(val)	((val) & 0xffff)
> > +#define _MEM                   (0)
> > +#define _MEMSWAP               (1)
> > +#define MEMFILE_PRIVATE(x, val)        (((x) << 16) | (val))
> > +#define MEMFILE_TYPE(val)      (((val) >> 16) & 0xffff)
> > +#define MEMFILE_ATTR(val)      ((val) & 0xffff)
> > +
> > +/* for orphan page_cgroups, guarded by zone->lock. */
> > +struct orphan_pcg_list {
> > +	struct list_head zone[MAX_NR_ZONES];
> > +};
> > +struct orphan_pcg_list *orphan_list[MAX_NUMNODES];
> > +atomic_t num_orphan_pages;
> > +
> > +static inline struct list_head *orphan_lru(int nid, int zid)
> > +{
> > +	/*
> > +	 * to kick this BUG_ON(), swapcache must be generated while init.
> > +	 * or NID should be invalid.
> > +	 */
> > +	BUG_ON(!orphan_list[nid]);
> > +	return  &orphan_list[nid]->zone[zid];
> > +}
> > +
> >  
> >  static void mem_cgroup_get(struct mem_cgroup *mem);
> >  static void mem_cgroup_put(struct mem_cgroup *mem);
> > @@ -380,6 +398,14 @@ void mem_cgroup_del_lru_list(struct page
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  	pc = lookup_page_cgroup(page);
> > +	/*
> > +	 * If the page is SwapCache and already on global LRU, it will be on
> > +	 * orphan list. remove here
> > +	 */
> > +	if (unlikely(PageSwapCache(page) && TestClearPageCgroupOrphan(pc))) {
> > +		list_del_init(&pc->lru);
> > +		atomic_dec(&num_orphan_pages);
> > +	}
> >  	/* can happen while we handle swapcache. */
> >  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
> >  		return;
> Is this check "PageSwapCache(page)" needed ?
> 
Because TestClearPageCgroupOrhpan() is atomic, I filter it by SwapCache() check.

> What happens, for example, if a swapcache which has been swaped-in by readahead
> and has not been mapped by the owner process is zapped by the process ?
Will never be zapped by process. It'll be on LRU as stale SwapCache().
This orphan list is for such pages.

> IIUC, free_swap_and_cache() removes the page from swapcache before the page is
> removed from LRU.
> 
If the page is removed from SwapCache, there is no problem of swp_entry leak.
right ?

I'm sorry if I miss your point.


> > @@ -414,7 +440,7 @@ void mem_cgroup_rotate_lru_list(struct p
> >  	 */
> >  	smp_rmb();
> >  	/* unused page is not rotated. */
> > -	if (!PageCgroupUsed(pc))
> > +	if (unlikely(!PageCgroupUsed(pc)))
> >  		return;
> >  	mz = page_cgroup_zoneinfo(pc);
> >  	list_move(&pc->lru, &mz->lists[lru]);
> > @@ -433,8 +459,15 @@ void mem_cgroup_add_lru_list(struct page
> >  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
> >  	 */
> >  	smp_rmb();
> > -	if (!PageCgroupUsed(pc))
> > +	if (unlikely(!PageCgroupUsed(pc))) {
> > +		if (PageSwapCache(page) && !TestSetPageCgroupOrphan(pc)) {
> > +			struct list_head *lru;
> > +			lru = orphan_lru(page_to_nid(page), page_zonenum(page));
> > +			list_add_tail(&pc->lru, lru);
> > +			atomic_inc(&num_orphan_pages);
> > +		}
> >  		return;
> > +	}
> >  
> >  	mz = page_cgroup_zoneinfo(pc);
> >  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> > @@ -784,6 +817,95 @@ static int mem_cgroup_count_children(str
> >  	return num;
> >  }
> >  
> > +
> > +
> > +/* Using big number here for avoiding to free swap-cache of readahead. */
> > +#define CHECK_ORPHAN_THRESH  (4096)
> > +
> > +static __init void init_orphan_lru(void)
> > +{
> > +	struct orphan_pcg_list *opl;
> > +	int nid, zid;
> > +
> > +	for_each_node_state(nid, N_POSSIBLE) {
> > +		opl = kmalloc(sizeof(struct orphan_pcg_list),  GFP_KERNEL);
> > +		BUG_ON(!opl);
> > +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> > +			INIT_LIST_HEAD(&opl->zone[zid]);
> > +		orphan_list[nid] = opl;
> > +	}
> > +}
> > +/* 
> > + * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
> > + * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
> > + */
> > +static int drain_orphan_swapcaches(int nid, int zid)
> > +{
> > +	struct page_cgroup *pc;
> > +	struct zone *zone;
> > +	struct page *page;
> > +	struct list_head *lru = orphan_lru(nid, zid);
> > +	unsigned long flags;
> > +	int drain, scan;
> > +
> > +	zone = &NODE_DATA(nid)->node_zones[zid];
> > +	/* check one by one */
> > +	scan = 0;
> > +	drain = 0;
> > +	spin_lock_irqsave(&zone->lru_lock, flags);
> > +	while (!list_empty(lru) && (scan < SWAP_CLUSTER_MAX*2)) {
> > +		scan++;
> > +		pc = list_entry(lru->next, struct page_cgroup, lru);
> > +		page = pc->page;
> > +		/* Rotate */
> > +		list_del(&pc->lru);
> > +		list_add_tail(&pc->lru, lru);
> > +		/* get page for isolate_lru_page() */
> > +		if (get_page_unless_zero(page)) {
> > +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > +			if (!isolate_lru_page(page)) {
> > +				/* This page is not ON LRU */
> > +				if (trylock_page(page)) {
> > +					drain += try_to_free_swap(page);
> > +					unlock_page(page);
> > +				}
> > +				putback_lru_page(page);
> > +			}
> > +			put_page(page);
> > +			spin_lock_irqsave(&zone->lru_lock, flags);		
> > +		}
> > +	}
> > +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> > +
> > +	return drain;
> > +}
> > +
> > +static int last_visit;
> > +void check_stale_swapcaches(void)
> > +{
> > +	int nid, zid, drain;
> > +	
> > +	nid = last_visit;
> > +	drain = 0;
> > +	
> > +	if (atomic_read(&num_orphan_pages) < CHECK_ORPHAN_THRESH)
> > +		return;
> > +		
> > +again:
> > +	nid = next_node(nid, node_states[N_HIGH_MEMORY]);
> > +	if (nid == MAX_NUMNODES) {
> > +		nid = 0;
> > +		if (!node_state(nid, N_HIGH_MEMORY))
> > +			goto again;
> > +	}
> > +	last_visit = nid;
> > +
> > +	for (zid = 0; !drain && zid < MAX_NR_ZONES; zid++)
> > +		drain += drain_orphan_swapcaches(nid, zid);
> > +}
> > +
> > +
> > +
> >  /*
> >   * Visit the first child (need not be the first child as per the ordering
> >   * of the cgroup list, since we track last_scanned_child) of @mem and use
> > @@ -842,6 +964,9 @@ static int mem_cgroup_hierarchical_recla
> >  	int ret, total = 0;
> >  	int loop = 0;
> >  
> > +	if (vm_swap_full())
> > +		check_stale_swapcaches();
> > +
> hmm... vm_swap_full() would be enough from the kernel point of view, but
> users can see the "leak" if the swapsize is big(I don't want to here
> "hey! something is leaking!").
> 
> Just removing vm_swap_full() is enough, isn't it ?
> check_stale_swapcaches() checks the threshhold.
> 
How about
      if (vm_swap_full() || noswap)

?

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-18  1:34           ` KAMEZAWA Hiroyuki
@ 2009-03-18  3:51             ` Daisuke Nishimura
  2009-03-18  4:05               ` KAMEZAWA Hiroyuki
  2009-03-18  8:57               ` [PATCH] fix unused/stale swap cache handling on memcg v1 (Re: " KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-18  3:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Wed, 18 Mar 2009 10:34:18 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 18 Mar 2009 10:17:27 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > ==
> > > ---
> > >  include/linux/page_cgroup.h |   13 ++++
> > >  mm/memcontrol.c             |  140 +++++++++++++++++++++++++++++++++++++++++---
> > >  2 files changed, 146 insertions(+), 7 deletions(-)
> > > 
> > > Index: mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> > > ===================================================================
> > > --- mmotm-2.6.29-Mar13.orig/include/linux/page_cgroup.h
> > > +++ mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> > > @@ -26,6 +26,7 @@ enum {
> > >  	PCG_LOCK,  /* page cgroup is locked */
> > >  	PCG_CACHE, /* charged as cache */
> > >  	PCG_USED, /* this object is in use. */
> > > +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
> > >  };
> > >  
> > >  #define TESTPCGFLAG(uname, lname)			\
> > > @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
> > >  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
> > >  	{ clear_bit(PCG_##lname, &pc->flags);  }
> > >  
> > > +#define TESTSETPCGFLAG(uname, lname) \
> > > +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> > > +        { return test_and_set_bit(PCG_##lname, &pc->flags);}
> > > +
> > > +#define TESTCLEARPCGFLAG(uname, lname) \
> > > +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> > > +        { return test_and_clear_bit(PCG_##lname, &pc->flags);}
> > > +
> > >  /* Cache flag is set only once (at allocation) */
> > >  TESTPCGFLAG(Cache, CACHE)
> > >  
> > >  TESTPCGFLAG(Used, USED)
> > >  CLEARPCGFLAG(Used, USED)
> > >  
> > > +TESTSETPCGFLAG(Orphan, ORPHAN)
> > > +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> > > +
> > > +
> > >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> > >  {
> > >  	return page_to_nid(pc->page);
> > > Index: mmotm-2.6.29-Mar13/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-2.6.29-Mar13.orig/mm/memcontrol.c
> > > +++ mmotm-2.6.29-Mar13/mm/memcontrol.c
> > > @@ -204,11 +204,29 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
> > >  };
> > >  
> > >  /* for encoding cft->private value on file */
> > > -#define _MEM			(0)
> > > -#define _MEMSWAP		(1)
> > > -#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
> > > -#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
> > > -#define MEMFILE_ATTR(val)	((val) & 0xffff)
> > > +#define _MEM                   (0)
> > > +#define _MEMSWAP               (1)
> > > +#define MEMFILE_PRIVATE(x, val)        (((x) << 16) | (val))
> > > +#define MEMFILE_TYPE(val)      (((val) >> 16) & 0xffff)
> > > +#define MEMFILE_ATTR(val)      ((val) & 0xffff)
> > > +
> > > +/* for orphan page_cgroups, guarded by zone->lock. */
> > > +struct orphan_pcg_list {
> > > +	struct list_head zone[MAX_NR_ZONES];
> > > +};
> > > +struct orphan_pcg_list *orphan_list[MAX_NUMNODES];
> > > +atomic_t num_orphan_pages;
> > > +
> > > +static inline struct list_head *orphan_lru(int nid, int zid)
> > > +{
> > > +	/*
> > > +	 * to kick this BUG_ON(), swapcache must be generated while init.
> > > +	 * or NID should be invalid.
> > > +	 */
> > > +	BUG_ON(!orphan_list[nid]);
> > > +	return  &orphan_list[nid]->zone[zid];
> > > +}
> > > +
> > >  
> > >  static void mem_cgroup_get(struct mem_cgroup *mem);
> > >  static void mem_cgroup_put(struct mem_cgroup *mem);
> > > @@ -380,6 +398,14 @@ void mem_cgroup_del_lru_list(struct page
> > >  	if (mem_cgroup_disabled())
> > >  		return;
> > >  	pc = lookup_page_cgroup(page);
> > > +	/*
> > > +	 * If the page is SwapCache and already on global LRU, it will be on
> > > +	 * orphan list. remove here
> > > +	 */
> > > +	if (unlikely(PageSwapCache(page) && TestClearPageCgroupOrphan(pc))) {
> > > +		list_del_init(&pc->lru);
> > > +		atomic_dec(&num_orphan_pages);
> > > +	}
> > >  	/* can happen while we handle swapcache. */
> > >  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
> > >  		return;
> > Is this check "PageSwapCache(page)" needed ?
> > 
> Because TestClearPageCgroupOrhpan() is atomic, I filter it by SwapCache() check.
> 
> > What happens, for example, if a swapcache which has been swaped-in by readahead
> > and has not been mapped by the owner process is zapped by the process ?
> Will never be zapped by process. It'll be on LRU as stale SwapCache().
> This orphan list is for such pages.
> 
Ah, I'm saying about the case of:
  zap_pte_range() -> (the pte holds a swp_entry on swapcache) free_swap_and_cache()

> > IIUC, free_swap_and_cache() removes the page from swapcache before the page is
> > removed from LRU.
> > 
> If the page is removed from SwapCache, there is no problem of swp_entry leak.
> right ?
> 
> I'm sorry if I miss your point.
> 
> 
Yes, there would be no problem of swp_entry leak, but these pages, which have
been removed from swapcache and are being free'ed by free_swap_and_cache,
cannot be removed from orphan_lru, although they are removed from global LRU, right ?

> > > @@ -414,7 +440,7 @@ void mem_cgroup_rotate_lru_list(struct p
> > >  	 */
> > >  	smp_rmb();
> > >  	/* unused page is not rotated. */
> > > -	if (!PageCgroupUsed(pc))
> > > +	if (unlikely(!PageCgroupUsed(pc)))
> > >  		return;
> > >  	mz = page_cgroup_zoneinfo(pc);
> > >  	list_move(&pc->lru, &mz->lists[lru]);
> > > @@ -433,8 +459,15 @@ void mem_cgroup_add_lru_list(struct page
> > >  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
> > >  	 */
> > >  	smp_rmb();
> > > -	if (!PageCgroupUsed(pc))
> > > +	if (unlikely(!PageCgroupUsed(pc))) {
> > > +		if (PageSwapCache(page) && !TestSetPageCgroupOrphan(pc)) {
> > > +			struct list_head *lru;
> > > +			lru = orphan_lru(page_to_nid(page), page_zonenum(page));
> > > +			list_add_tail(&pc->lru, lru);
> > > +			atomic_inc(&num_orphan_pages);
> > > +		}
> > >  		return;
> > > +	}
> > >  
> > >  	mz = page_cgroup_zoneinfo(pc);
> > >  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> > > @@ -784,6 +817,95 @@ static int mem_cgroup_count_children(str
> > >  	return num;
> > >  }
> > >  
> > > +
> > > +
> > > +/* Using big number here for avoiding to free swap-cache of readahead. */
> > > +#define CHECK_ORPHAN_THRESH  (4096)
> > > +
> > > +static __init void init_orphan_lru(void)
> > > +{
> > > +	struct orphan_pcg_list *opl;
> > > +	int nid, zid;
> > > +
> > > +	for_each_node_state(nid, N_POSSIBLE) {
> > > +		opl = kmalloc(sizeof(struct orphan_pcg_list),  GFP_KERNEL);
> > > +		BUG_ON(!opl);
> > > +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> > > +			INIT_LIST_HEAD(&opl->zone[zid]);
> > > +		orphan_list[nid] = opl;
> > > +	}
> > > +}
> > > +/* 
> > > + * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
> > > + * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
> > > + */
> > > +static int drain_orphan_swapcaches(int nid, int zid)
> > > +{
> > > +	struct page_cgroup *pc;
> > > +	struct zone *zone;
> > > +	struct page *page;
> > > +	struct list_head *lru = orphan_lru(nid, zid);
> > > +	unsigned long flags;
> > > +	int drain, scan;
> > > +
> > > +	zone = &NODE_DATA(nid)->node_zones[zid];
> > > +	/* check one by one */
> > > +	scan = 0;
> > > +	drain = 0;
> > > +	spin_lock_irqsave(&zone->lru_lock, flags);
> > > +	while (!list_empty(lru) && (scan < SWAP_CLUSTER_MAX*2)) {
> > > +		scan++;
> > > +		pc = list_entry(lru->next, struct page_cgroup, lru);
> > > +		page = pc->page;
> > > +		/* Rotate */
> > > +		list_del(&pc->lru);
> > > +		list_add_tail(&pc->lru, lru);
> > > +		/* get page for isolate_lru_page() */
> > > +		if (get_page_unless_zero(page)) {
> > > +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > +			if (!isolate_lru_page(page)) {
> > > +				/* This page is not ON LRU */
> > > +				if (trylock_page(page)) {
> > > +					drain += try_to_free_swap(page);
> > > +					unlock_page(page);
> > > +				}
> > > +				putback_lru_page(page);
> > > +			}
> > > +			put_page(page);
> > > +			spin_lock_irqsave(&zone->lru_lock, flags);		
> > > +		}
> > > +	}
> > > +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > +
> > > +	return drain;
> > > +}
> > > +
> > > +static int last_visit;
> > > +void check_stale_swapcaches(void)
> > > +{
> > > +	int nid, zid, drain;
> > > +	
> > > +	nid = last_visit;
> > > +	drain = 0;
> > > +	
> > > +	if (atomic_read(&num_orphan_pages) < CHECK_ORPHAN_THRESH)
> > > +		return;
> > > +		
> > > +again:
> > > +	nid = next_node(nid, node_states[N_HIGH_MEMORY]);
> > > +	if (nid == MAX_NUMNODES) {
> > > +		nid = 0;
> > > +		if (!node_state(nid, N_HIGH_MEMORY))
> > > +			goto again;
> > > +	}
> > > +	last_visit = nid;
> > > +
> > > +	for (zid = 0; !drain && zid < MAX_NR_ZONES; zid++)
> > > +		drain += drain_orphan_swapcaches(nid, zid);
> > > +}
> > > +
> > > +
> > > +
> > >  /*
> > >   * Visit the first child (need not be the first child as per the ordering
> > >   * of the cgroup list, since we track last_scanned_child) of @mem and use
> > > @@ -842,6 +964,9 @@ static int mem_cgroup_hierarchical_recla
> > >  	int ret, total = 0;
> > >  	int loop = 0;
> > >  
> > > +	if (vm_swap_full())
> > > +		check_stale_swapcaches();
> > > +
> > hmm... vm_swap_full() would be enough from the kernel point of view, but
> > users can see the "leak" if the swapsize is big(I don't want to here
> > "hey! something is leaking!").
> > 
> > Just removing vm_swap_full() is enough, isn't it ?
> > check_stale_swapcaches() checks the threshhold.
> > 
> How about
>       if (vm_swap_full() || noswap)
> 
> ?
> 
It may work for type-1 of swapcaches that I described in first mail,
because memsw charges of them are not uncharged while they are on swapcache.

But it doesn't work for type-2 of swapcaches because they are uncharged
from both mem and memsw.

Hmm, should I send a patch for shrink_page_list() attached in first mail
as another patch ?


Thanks,
Dasiuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] memcg: handle swapcache leak
  2009-03-18  3:51             ` Daisuke Nishimura
@ 2009-03-18  4:05               ` KAMEZAWA Hiroyuki
  2009-03-18  8:57               ` [PATCH] fix unused/stale swap cache handling on memcg v1 (Re: " KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-18  4:05 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Hugh Dickins

On Wed, 18 Mar 2009 12:51:54 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > I'm sorry if I miss your point.
> > 
> > 
> Yes, there would be no problem of swp_entry leak, but these pages, which have
> been removed from swapcache and are being free'ed by free_swap_and_cache,
> cannot be removed from orphan_lru, although they are removed from global LRU, right ?
> 
Ah, I see. thank you. SwapCache flags is deleted before deleting from LRU.
OK, will fix.

> > 
> It may work for type-1 of swapcaches that I described in first mail,
> because memsw charges of them are not uncharged while they are on swapcache.
> 
> But it doesn't work for type-2 of swapcaches because they are uncharged
> from both mem and memsw.
> 
> Hmm, should I send a patch for shrink_page_list() attached in first mail
> as another patch ?
> 
Ok, just check the number of pages on orphan list.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] fix unused/stale swap cache handling on memcg  v1 (Re: [RFC] memcg: handle swapcache leak
  2009-03-18  3:51             ` Daisuke Nishimura
  2009-03-18  4:05               ` KAMEZAWA Hiroyuki
@ 2009-03-18  8:57               ` KAMEZAWA Hiroyuki
  2009-03-18 14:17                 ` Daisuke Nishimura
  1 sibling, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-18  8:57 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Hugh Dickins

How about this ? I did short test and this eems to work well.
I'm glad if you share us your test method.
(Hopefully, update memcg_debug.txt ;)

Changes:
 - modified condition to trigger reclaim orphan paes.

If I get good answer, I'll repost this with CC: to Andrew.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Nishimura reported unused-swap-cache is not reclaimed well under memcg.

Assume that memory cgroup well limits the memory usage of all applications
and file caches, and global-LRU-scan (kswapd() etc..) never runs.

First, there is *allowed* race to SwapCache on global LRU. There can be
SwapCaches on global LRU, even when swp_entry is not referred by anyone(ptes).
When global LRU scan runs, it will be reclaimed by try_to_free_swap().
But, they will not appear in memcg's private LRU and never reclaimed by
memcg's reclaim routines.

Second, there are readahead SwapCaches, some of then tend to be not used
and reclaimed by global LRU when scan runs, at last. But they are not on
memcg's private LRU and will not be reclaimed until global-lru-scan runs.

>From memcg's point of view, above 2 is not very good. Especially, *unused*
swp_entry adds pressure to memcg's mem+swap controller and finally cause OOM.
(Nishimura confirmed this can cause OOM.)

This patch tries to reclaim unused-swapcache by 
  - add a list for unused-swapcache (orphan_list)
  - try to recalim orhan list by some threshold.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |   13 +++
 mm/memcontrol.c             |  185 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 197 insertions(+), 1 deletion(-)

Index: mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.29-Mar13.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
 static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
+
+#define TESTCLEARPCGFLAG(uname, lname) \
+static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
+	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
+TESTPCGFLAG(Orphan, ORPHAN)
+TESTSETPCGFLAG(Orphan, ORPHAN)
+TESTCLEARPCGFLAG(Orphan, ORPHAN)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
Index: mmotm-2.6.29-Mar13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Mar13.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Mar13/mm/memcontrol.c
@@ -371,6 +371,61 @@ static int mem_cgroup_walk_tree(struct m
  * When moving account, the page is not on LRU. It's isolated.
  */
 
+/*
+ * Orphan List is a list for page_cgroup which is not free but not under
+ * any cgroup. SwapCache which is prefetched by readahead() is typical type but
+ * there are other corner cases.
+ *
+ * Usually, updates to this list happens when swap cache is readaheaded and
+ * finally used by process.
+ */
+
+/* for orphan page_cgroups, updated under zone->lru_lock. */
+
+struct orphan_list_node {
+	struct orphan_list_zone {
+		int event;
+		struct list_head list;
+	} zone[MAX_NR_ZONES];
+};
+struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
+#define ORPHAN_EVENT_THRESH (256)
+static void check_orphan_stat(void);
+atomic_t nr_orphan_caches;
+
+static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
+{
+	/*
+	 * 2 cases for this BUG_ON(), swapcache is generated while init.
+	 * or NID should be invalid.
+	 */
+	BUG_ON(!orphan_list[nid]);
+	return  &orphan_list[nid]->zone[zid];
+}
+
+static inline void remove_orphan_list(struct page_cgroup *pc)
+{
+	if (TestClearPageCgroupOrphan(pc)) {
+		list_del_init(&pc->lru);
+		atomic_dec(&nr_orphan_caches);
+	}
+}
+
+static void add_orphan_list(struct page *page, struct page_cgroup *pc)
+{
+	if (TestSetPageCgroupOrphan(pc)) {
+		struct orphan_list_zone *opl;
+		opl = orphan_lru(page_to_nid(page), page_zonenum(page));
+		list_add_tail(&pc->lru, &opl->list);
+		atomic_inc(&nr_orphan_caches);
+		if (opl->event++ > ORPHAN_EVENT_THRESH) {
+			check_orphan_stat();
+			opl->event = 0;
+		}
+	}
+}
+
+
 void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
@@ -380,6 +435,14 @@ void mem_cgroup_del_lru_list(struct page
 	if (mem_cgroup_disabled())
 		return;
 	pc = lookup_page_cgroup(page);
+	/*
+	 * If the page is SwapCache and already on global LRU, it will be on
+	 * orphan list. remove here
+	 */
+	if (unlikely(PageCgroupOrphan(pc))) {
+		remove_orphan_list(pc);
+		return;
+	}
 	/* can happen while we handle swapcache. */
 	if (list_empty(&pc->lru) || !pc->mem_cgroup)
 		return;
@@ -433,8 +496,11 @@ void mem_cgroup_add_lru_list(struct page
 	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
 	 */
 	smp_rmb();
-	if (!PageCgroupUsed(pc))
+	if (!PageCgroupUsed(pc)) {
+		/* handle swap cache here */
+		add_orphan_list(page, pc);
 		return;
+	}
 
 	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -471,6 +537,9 @@ static void mem_cgroup_lru_add_after_com
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 
 	spin_lock_irqsave(&zone->lru_lock, flags);
+	if (PageCgroupOrphan(pc))
+		remove_orphan_list(pc);
+
 	/* link when the page is linked to LRU but page_cgroup isn't */
 	if (PageLRU(page) && list_empty(&pc->lru))
 		mem_cgroup_add_lru_list(page, page_lru(page));
@@ -784,6 +853,119 @@ static int mem_cgroup_count_children(str
 	return num;
 }
 
+
+
+/*
+ * Using big number here for avoiding to free orphan swap-cache by readahead
+ * We don't want to delete swap caches read by readahead.
+ */
+static int orphan_thresh(void)
+{
+	int nr_pages = (1 << page_cluster); /* max size of a swap readahead */
+	int base = num_online_cpus() * 256; /* 1M per cpu if swap is 4k */
+
+	nr_pages *= nr_threads/2; /* nr_threads can be too big, too small */
+
+	/* too small value will kill readahead */
+	if (nr_pages < base)
+		return base;
+
+	/* too big is not suitable here */
+	if (nr_pages > base * 4)
+		return base * 4;
+
+	return nr_pages;
+}
+
+/*
+ * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
+ * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
+ */
+static int drain_orphan_swapcaches(int nid, int zid)
+{
+	struct page_cgroup *pc;
+	struct zone *zone;
+	struct page *page;
+	struct orphan_list_zone *lru = orphan_lru(nid, zid);
+	unsigned long flags;
+	int drain, scan;
+
+	zone = &NODE_DATA(nid)->node_zones[zid];
+	/* check one by one */
+	scan = 0;
+	drain = 0;
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	while (!list_empty(&lru->list) && (scan < SWAP_CLUSTER_MAX*2)) {
+		scan++;
+		pc = list_entry(lru->list.next, struct page_cgroup, lru);
+		page = pc->page;
+		/* Rotate */
+		list_del(&pc->lru);
+		list_add_tail(&pc->lru, &lru->list);
+		/* get page for isolate_lru_page() */
+		if (get_page_unless_zero(page)) {
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			if (!isolate_lru_page(page)) {
+				/* Now, This page is removed from LRU */
+				if (trylock_page(page)) {
+					drain += try_to_free_swap(page);
+					unlock_page(page);
+				}
+				putback_lru_page(page);
+			}
+			put_page(page);
+			spin_lock_irqsave(&zone->lru_lock, flags);
+		}
+	}
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+
+	return drain;
+}
+
+/* access without lock...serialization is not so important here. */
+static int last_visit;
+void try_delete_orphan_caches(struct work_struct *work)
+{
+	int nid, zid, drain;
+
+	nid = last_visit;
+	drain = 0;
+	while (drain < SWAP_CLUSTER_MAX) {
+		nid = next_node(nid, node_states[N_HIGH_MEMORY]);
+		if (nid == MAX_NUMNODES)
+			nid = 0;
+		last_visit = nid;
+		if (node_state(nid, N_HIGH_MEMORY))
+			for (zid = 0; zid < MAX_NR_ZONES; zid++)
+				drain += drain_orphan_swapcaches(nid, zid);
+		if (nid == 0)
+			break;
+	}
+}
+DECLARE_WORK(orphan_delete_work, try_delete_orphan_caches);
+
+static void check_orphan_stat(void)
+{
+	if (atomic_read(&nr_orphan_caches) > orphan_thresh())
+		schedule_work(&orphan_delete_work);
+}
+
+static __init void init_orphan_lru(void)
+{
+	struct orphan_list_node *opl;
+	int nid, zid;
+
+	for_each_node_state(nid, N_POSSIBLE) {
+		opl = kmalloc(sizeof(struct orphan_list_node),  GFP_KERNEL);
+		BUG_ON(!opl);
+		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+			INIT_LIST_HEAD(&opl->zone[zid].list);
+			opl->zone[zid].event = 0;
+		}
+		orphan_list[nid] = opl;
+	}
+}
+
 /*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -2454,6 +2636,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	/* root ? */
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
+		init_orphan_lru();
 		parent = NULL;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v1 (Re: [RFC] memcg: handle swapcache leak
  2009-03-18  8:57               ` [PATCH] fix unused/stale swap cache handling on memcg v1 (Re: " KAMEZAWA Hiroyuki
@ 2009-03-18 14:17                 ` Daisuke Nishimura
  2009-03-18 23:45                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-18 14:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Balbir Singh, Hugh Dickins, d-nishimura, Daisuke Nishimura

On Wed, 18 Mar 2009 17:57:34 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> How about this ? I did short test and this eems to work well.
> I'm glad if you share us your test method.
> (Hopefully, update memcg_debug.txt ;)
> 
s/memcg_debug/memcg_test/ ?

I don't do anything special not in the document.
I just do one (or combination) of them for a long time and repeatedly,
and observe what happens.

> Changes:
>  - modified condition to trigger reclaim orphan paes.
> 
> If I get good answer, I'll repost this with CC: to Andrew.
> 
It looks good to me.

Unfortunately, I don't have enough time tomorrow.
I'll test this during this weekend and report the result in next week.


Thanks,
Daisuke Nishimura.

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Nishimura reported unused-swap-cache is not reclaimed well under memcg.
> 
> Assume that memory cgroup well limits the memory usage of all applications
> and file caches, and global-LRU-scan (kswapd() etc..) never runs.
> 
> First, there is *allowed* race to SwapCache on global LRU. There can be
> SwapCaches on global LRU, even when swp_entry is not referred by anyone(ptes).
> When global LRU scan runs, it will be reclaimed by try_to_free_swap().
> But, they will not appear in memcg's private LRU and never reclaimed by
> memcg's reclaim routines.
> 
> Second, there are readahead SwapCaches, some of then tend to be not used
> and reclaimed by global LRU when scan runs, at last. But they are not on
> memcg's private LRU and will not be reclaimed until global-lru-scan runs.
> 
> From memcg's point of view, above 2 is not very good. Especially, *unused*
> swp_entry adds pressure to memcg's mem+swap controller and finally cause OOM.
> (Nishimura confirmed this can cause OOM.)
> 
> This patch tries to reclaim unused-swapcache by 
>   - add a list for unused-swapcache (orphan_list)
>   - try to recalim orhan list by some threshold.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/page_cgroup.h |   13 +++
>  mm/memcontrol.c             |  185 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 197 insertions(+), 1 deletion(-)
> 
> Index: mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.29-Mar13.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> @@ -26,6 +26,7 @@ enum {
>  	PCG_LOCK,  /* page cgroup is locked */
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
> +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
>  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname) \
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> +
> +#define TESTCLEARPCGFLAG(uname, lname) \
> +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> +	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
> +
>  /* Cache flag is set only once (at allocation) */
>  TESTPCGFLAG(Cache, CACHE)
>  
>  TESTPCGFLAG(Used, USED)
>  CLEARPCGFLAG(Used, USED)
>  
> +TESTPCGFLAG(Orphan, ORPHAN)
> +TESTSETPCGFLAG(Orphan, ORPHAN)
> +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> Index: mmotm-2.6.29-Mar13/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.29-Mar13.orig/mm/memcontrol.c
> +++ mmotm-2.6.29-Mar13/mm/memcontrol.c
> @@ -371,6 +371,61 @@ static int mem_cgroup_walk_tree(struct m
>   * When moving account, the page is not on LRU. It's isolated.
>   */
>  
> +/*
> + * Orphan List is a list for page_cgroup which is not free but not under
> + * any cgroup. SwapCache which is prefetched by readahead() is typical type but
> + * there are other corner cases.
> + *
> + * Usually, updates to this list happens when swap cache is readaheaded and
> + * finally used by process.
> + */
> +
> +/* for orphan page_cgroups, updated under zone->lru_lock. */
> +
> +struct orphan_list_node {
> +	struct orphan_list_zone {
> +		int event;
> +		struct list_head list;
> +	} zone[MAX_NR_ZONES];
> +};
> +struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
> +#define ORPHAN_EVENT_THRESH (256)
> +static void check_orphan_stat(void);
> +atomic_t nr_orphan_caches;
> +
> +static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
> +{
> +	/*
> +	 * 2 cases for this BUG_ON(), swapcache is generated while init.
> +	 * or NID should be invalid.
> +	 */
> +	BUG_ON(!orphan_list[nid]);
> +	return  &orphan_list[nid]->zone[zid];
> +}
> +
> +static inline void remove_orphan_list(struct page_cgroup *pc)
> +{
> +	if (TestClearPageCgroupOrphan(pc)) {
> +		list_del_init(&pc->lru);
> +		atomic_dec(&nr_orphan_caches);
> +	}
> +}
> +
> +static void add_orphan_list(struct page *page, struct page_cgroup *pc)
> +{
> +	if (TestSetPageCgroupOrphan(pc)) {
> +		struct orphan_list_zone *opl;
> +		opl = orphan_lru(page_to_nid(page), page_zonenum(page));
> +		list_add_tail(&pc->lru, &opl->list);
> +		atomic_inc(&nr_orphan_caches);
> +		if (opl->event++ > ORPHAN_EVENT_THRESH) {
> +			check_orphan_stat();
> +			opl->event = 0;
> +		}
> +	}
> +}
> +
> +
>  void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
>  {
>  	struct page_cgroup *pc;
> @@ -380,6 +435,14 @@ void mem_cgroup_del_lru_list(struct page
>  	if (mem_cgroup_disabled())
>  		return;
>  	pc = lookup_page_cgroup(page);
> +	/*
> +	 * If the page is SwapCache and already on global LRU, it will be on
> +	 * orphan list. remove here
> +	 */
> +	if (unlikely(PageCgroupOrphan(pc))) {
> +		remove_orphan_list(pc);
> +		return;
> +	}
>  	/* can happen while we handle swapcache. */
>  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
>  		return;
> @@ -433,8 +496,11 @@ void mem_cgroup_add_lru_list(struct page
>  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
>  	 */
>  	smp_rmb();
> -	if (!PageCgroupUsed(pc))
> +	if (!PageCgroupUsed(pc)) {
> +		/* handle swap cache here */
> +		add_orphan_list(page, pc);
>  		return;
> +	}
>  
>  	mz = page_cgroup_zoneinfo(pc);
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> @@ -471,6 +537,9 @@ static void mem_cgroup_lru_add_after_com
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>  
>  	spin_lock_irqsave(&zone->lru_lock, flags);
> +	if (PageCgroupOrphan(pc))
> +		remove_orphan_list(pc);
> +
>  	/* link when the page is linked to LRU but page_cgroup isn't */
>  	if (PageLRU(page) && list_empty(&pc->lru))
>  		mem_cgroup_add_lru_list(page, page_lru(page));
> @@ -784,6 +853,119 @@ static int mem_cgroup_count_children(str
>  	return num;
>  }
>  
> +
> +
> +/*
> + * Using big number here for avoiding to free orphan swap-cache by readahead
> + * We don't want to delete swap caches read by readahead.
> + */
> +static int orphan_thresh(void)
> +{
> +	int nr_pages = (1 << page_cluster); /* max size of a swap readahead */
> +	int base = num_online_cpus() * 256; /* 1M per cpu if swap is 4k */
> +
> +	nr_pages *= nr_threads/2; /* nr_threads can be too big, too small */
> +
> +	/* too small value will kill readahead */
> +	if (nr_pages < base)
> +		return base;
> +
> +	/* too big is not suitable here */
> +	if (nr_pages > base * 4)
> +		return base * 4;
> +
> +	return nr_pages;
> +}
> +
> +/*
> + * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
> + * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
> + */
> +static int drain_orphan_swapcaches(int nid, int zid)
> +{
> +	struct page_cgroup *pc;
> +	struct zone *zone;
> +	struct page *page;
> +	struct orphan_list_zone *lru = orphan_lru(nid, zid);
> +	unsigned long flags;
> +	int drain, scan;
> +
> +	zone = &NODE_DATA(nid)->node_zones[zid];
> +	/* check one by one */
> +	scan = 0;
> +	drain = 0;
> +	spin_lock_irqsave(&zone->lru_lock, flags);
> +	while (!list_empty(&lru->list) && (scan < SWAP_CLUSTER_MAX*2)) {
> +		scan++;
> +		pc = list_entry(lru->list.next, struct page_cgroup, lru);
> +		page = pc->page;
> +		/* Rotate */
> +		list_del(&pc->lru);
> +		list_add_tail(&pc->lru, &lru->list);
> +		/* get page for isolate_lru_page() */
> +		if (get_page_unless_zero(page)) {
> +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> +			if (!isolate_lru_page(page)) {
> +				/* Now, This page is removed from LRU */
> +				if (trylock_page(page)) {
> +					drain += try_to_free_swap(page);
> +					unlock_page(page);
> +				}
> +				putback_lru_page(page);
> +			}
> +			put_page(page);
> +			spin_lock_irqsave(&zone->lru_lock, flags);
> +		}
> +	}
> +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> +
> +	return drain;
> +}
> +
> +/* access without lock...serialization is not so important here. */
> +static int last_visit;
> +void try_delete_orphan_caches(struct work_struct *work)
> +{
> +	int nid, zid, drain;
> +
> +	nid = last_visit;
> +	drain = 0;
> +	while (drain < SWAP_CLUSTER_MAX) {
> +		nid = next_node(nid, node_states[N_HIGH_MEMORY]);
> +		if (nid == MAX_NUMNODES)
> +			nid = 0;
> +		last_visit = nid;
> +		if (node_state(nid, N_HIGH_MEMORY))
> +			for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +				drain += drain_orphan_swapcaches(nid, zid);
> +		if (nid == 0)
> +			break;
> +	}
> +}
> +DECLARE_WORK(orphan_delete_work, try_delete_orphan_caches);
> +
> +static void check_orphan_stat(void)
> +{
> +	if (atomic_read(&nr_orphan_caches) > orphan_thresh())
> +		schedule_work(&orphan_delete_work);
> +}
> +
> +static __init void init_orphan_lru(void)
> +{
> +	struct orphan_list_node *opl;
> +	int nid, zid;
> +
> +	for_each_node_state(nid, N_POSSIBLE) {
> +		opl = kmalloc(sizeof(struct orphan_list_node),  GFP_KERNEL);
> +		BUG_ON(!opl);
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +			INIT_LIST_HEAD(&opl->zone[zid].list);
> +			opl->zone[zid].event = 0;
> +		}
> +		orphan_list[nid] = opl;
> +	}
> +}
> +
>  /*
>   * Visit the first child (need not be the first child as per the ordering
>   * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -2454,6 +2636,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  	/* root ? */
>  	if (cont->parent == NULL) {
>  		enable_swap_cgroup();
> +		init_orphan_lru();
>  		parent = NULL;
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v1 (Re: [RFC] memcg: handle swapcache leak
  2009-03-18 14:17                 ` Daisuke Nishimura
@ 2009-03-18 23:45                   ` KAMEZAWA Hiroyuki
  2009-03-19  2:16                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-18 23:45 UTC (permalink / raw)
  To: nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Wed, 18 Mar 2009 23:17:38 +0900
Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:

> On Wed, 18 Mar 2009 17:57:34 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > How about this ? I did short test and this eems to work well.
> > I'm glad if you share us your test method.
> > (Hopefully, update memcg_debug.txt ;)
> > 
> s/memcg_debug/memcg_test/ ?
> 
> I don't do anything special not in the document.
> I just do one (or combination) of them for a long time and repeatedly,
> and observe what happens.
> 
Hmm, ok.

> > Changes:
> >  - modified condition to trigger reclaim orphan paes.
> > 
> > If I get good answer, I'll repost this with CC: to Andrew.
> > 
> It looks good to me.
> 
> Unfortunately, I don't have enough time tomorrow.
> I'll test this during this weekend and report the result in next week.
> 
Okay, thank you very much. I'll review this again.

Thanks,
-Kame


> 
> Thanks,
> Daisuke Nishimura.
> 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Nishimura reported unused-swap-cache is not reclaimed well under memcg.
> > 
> > Assume that memory cgroup well limits the memory usage of all applications
> > and file caches, and global-LRU-scan (kswapd() etc..) never runs.
> > 
> > First, there is *allowed* race to SwapCache on global LRU. There can be
> > SwapCaches on global LRU, even when swp_entry is not referred by anyone(ptes).
> > When global LRU scan runs, it will be reclaimed by try_to_free_swap().
> > But, they will not appear in memcg's private LRU and never reclaimed by
> > memcg's reclaim routines.
> > 
> > Second, there are readahead SwapCaches, some of then tend to be not used
> > and reclaimed by global LRU when scan runs, at last. But they are not on
> > memcg's private LRU and will not be reclaimed until global-lru-scan runs.
> > 
> > From memcg's point of view, above 2 is not very good. Especially, *unused*
> > swp_entry adds pressure to memcg's mem+swap controller and finally cause OOM.
> > (Nishimura confirmed this can cause OOM.)
> > 
> > This patch tries to reclaim unused-swapcache by 
> >   - add a list for unused-swapcache (orphan_list)
> >   - try to recalim orhan list by some threshold.
> > 
> > Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/page_cgroup.h |   13 +++
> >  mm/memcontrol.c             |  185 +++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 197 insertions(+), 1 deletion(-)
> > 
> > Index: mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-2.6.29-Mar13.orig/include/linux/page_cgroup.h
> > +++ mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> > @@ -26,6 +26,7 @@ enum {
> >  	PCG_LOCK,  /* page cgroup is locked */
> >  	PCG_CACHE, /* charged as cache */
> >  	PCG_USED, /* this object is in use. */
> > +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
> >  };
> >  
> >  #define TESTPCGFLAG(uname, lname)			\
> > @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
> >  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
> >  	{ clear_bit(PCG_##lname, &pc->flags);  }
> >  
> > +#define TESTSETPCGFLAG(uname, lname) \
> > +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> > +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> > +
> > +#define TESTCLEARPCGFLAG(uname, lname) \
> > +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> > +	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
> > +
> >  /* Cache flag is set only once (at allocation) */
> >  TESTPCGFLAG(Cache, CACHE)
> >  
> >  TESTPCGFLAG(Used, USED)
> >  CLEARPCGFLAG(Used, USED)
> >  
> > +TESTPCGFLAG(Orphan, ORPHAN)
> > +TESTSETPCGFLAG(Orphan, ORPHAN)
> > +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> > +
> >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> >  	return page_to_nid(pc->page);
> > Index: mmotm-2.6.29-Mar13/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.29-Mar13.orig/mm/memcontrol.c
> > +++ mmotm-2.6.29-Mar13/mm/memcontrol.c
> > @@ -371,6 +371,61 @@ static int mem_cgroup_walk_tree(struct m
> >   * When moving account, the page is not on LRU. It's isolated.
> >   */
> >  
> > +/*
> > + * Orphan List is a list for page_cgroup which is not free but not under
> > + * any cgroup. SwapCache which is prefetched by readahead() is typical type but
> > + * there are other corner cases.
> > + *
> > + * Usually, updates to this list happens when swap cache is readaheaded and
> > + * finally used by process.
> > + */
> > +
> > +/* for orphan page_cgroups, updated under zone->lru_lock. */
> > +
> > +struct orphan_list_node {
> > +	struct orphan_list_zone {
> > +		int event;
> > +		struct list_head list;
> > +	} zone[MAX_NR_ZONES];
> > +};
> > +struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
> > +#define ORPHAN_EVENT_THRESH (256)
> > +static void check_orphan_stat(void);
> > +atomic_t nr_orphan_caches;
> > +
> > +static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
> > +{
> > +	/*
> > +	 * 2 cases for this BUG_ON(), swapcache is generated while init.
> > +	 * or NID should be invalid.
> > +	 */
> > +	BUG_ON(!orphan_list[nid]);
> > +	return  &orphan_list[nid]->zone[zid];
> > +}
> > +
> > +static inline void remove_orphan_list(struct page_cgroup *pc)
> > +{
> > +	if (TestClearPageCgroupOrphan(pc)) {
> > +		list_del_init(&pc->lru);
> > +		atomic_dec(&nr_orphan_caches);
> > +	}
> > +}
> > +
> > +static void add_orphan_list(struct page *page, struct page_cgroup *pc)
> > +{
> > +	if (TestSetPageCgroupOrphan(pc)) {
> > +		struct orphan_list_zone *opl;
> > +		opl = orphan_lru(page_to_nid(page), page_zonenum(page));
> > +		list_add_tail(&pc->lru, &opl->list);
> > +		atomic_inc(&nr_orphan_caches);
> > +		if (opl->event++ > ORPHAN_EVENT_THRESH) {
> > +			check_orphan_stat();
> > +			opl->event = 0;
> > +		}
> > +	}
> > +}
> > +
> > +
> >  void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
> >  {
> >  	struct page_cgroup *pc;
> > @@ -380,6 +435,14 @@ void mem_cgroup_del_lru_list(struct page
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  	pc = lookup_page_cgroup(page);
> > +	/*
> > +	 * If the page is SwapCache and already on global LRU, it will be on
> > +	 * orphan list. remove here
> > +	 */
> > +	if (unlikely(PageCgroupOrphan(pc))) {
> > +		remove_orphan_list(pc);
> > +		return;
> > +	}
> >  	/* can happen while we handle swapcache. */
> >  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
> >  		return;
> > @@ -433,8 +496,11 @@ void mem_cgroup_add_lru_list(struct page
> >  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
> >  	 */
> >  	smp_rmb();
> > -	if (!PageCgroupUsed(pc))
> > +	if (!PageCgroupUsed(pc)) {
> > +		/* handle swap cache here */
> > +		add_orphan_list(page, pc);
> >  		return;
> > +	}
> >  
> >  	mz = page_cgroup_zoneinfo(pc);
> >  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> > @@ -471,6 +537,9 @@ static void mem_cgroup_lru_add_after_com
> >  	struct page_cgroup *pc = lookup_page_cgroup(page);
> >  
> >  	spin_lock_irqsave(&zone->lru_lock, flags);
> > +	if (PageCgroupOrphan(pc))
> > +		remove_orphan_list(pc);
> > +
> >  	/* link when the page is linked to LRU but page_cgroup isn't */
> >  	if (PageLRU(page) && list_empty(&pc->lru))
> >  		mem_cgroup_add_lru_list(page, page_lru(page));
> > @@ -784,6 +853,119 @@ static int mem_cgroup_count_children(str
> >  	return num;
> >  }
> >  
> > +
> > +
> > +/*
> > + * Using big number here for avoiding to free orphan swap-cache by readahead
> > + * We don't want to delete swap caches read by readahead.
> > + */
> > +static int orphan_thresh(void)
> > +{
> > +	int nr_pages = (1 << page_cluster); /* max size of a swap readahead */
> > +	int base = num_online_cpus() * 256; /* 1M per cpu if swap is 4k */
> > +
> > +	nr_pages *= nr_threads/2; /* nr_threads can be too big, too small */
> > +
> > +	/* too small value will kill readahead */
> > +	if (nr_pages < base)
> > +		return base;
> > +
> > +	/* too big is not suitable here */
> > +	if (nr_pages > base * 4)
> > +		return base * 4;
> > +
> > +	return nr_pages;
> > +}
> > +
> > +/*
> > + * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
> > + * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
> > + */
> > +static int drain_orphan_swapcaches(int nid, int zid)
> > +{
> > +	struct page_cgroup *pc;
> > +	struct zone *zone;
> > +	struct page *page;
> > +	struct orphan_list_zone *lru = orphan_lru(nid, zid);
> > +	unsigned long flags;
> > +	int drain, scan;
> > +
> > +	zone = &NODE_DATA(nid)->node_zones[zid];
> > +	/* check one by one */
> > +	scan = 0;
> > +	drain = 0;
> > +	spin_lock_irqsave(&zone->lru_lock, flags);
> > +	while (!list_empty(&lru->list) && (scan < SWAP_CLUSTER_MAX*2)) {
> > +		scan++;
> > +		pc = list_entry(lru->list.next, struct page_cgroup, lru);
> > +		page = pc->page;
> > +		/* Rotate */
> > +		list_del(&pc->lru);
> > +		list_add_tail(&pc->lru, &lru->list);
> > +		/* get page for isolate_lru_page() */
> > +		if (get_page_unless_zero(page)) {
> > +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > +			if (!isolate_lru_page(page)) {
> > +				/* Now, This page is removed from LRU */
> > +				if (trylock_page(page)) {
> > +					drain += try_to_free_swap(page);
> > +					unlock_page(page);
> > +				}
> > +				putback_lru_page(page);
> > +			}
> > +			put_page(page);
> > +			spin_lock_irqsave(&zone->lru_lock, flags);
> > +		}
> > +	}
> > +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> > +
> > +	return drain;
> > +}
> > +
> > +/* access without lock...serialization is not so important here. */
> > +static int last_visit;
> > +void try_delete_orphan_caches(struct work_struct *work)
> > +{
> > +	int nid, zid, drain;
> > +
> > +	nid = last_visit;
> > +	drain = 0;
> > +	while (drain < SWAP_CLUSTER_MAX) {
> > +		nid = next_node(nid, node_states[N_HIGH_MEMORY]);
> > +		if (nid == MAX_NUMNODES)
> > +			nid = 0;
> > +		last_visit = nid;
> > +		if (node_state(nid, N_HIGH_MEMORY))
> > +			for (zid = 0; zid < MAX_NR_ZONES; zid++)
> > +				drain += drain_orphan_swapcaches(nid, zid);
> > +		if (nid == 0)
> > +			break;
> > +	}
> > +}
> > +DECLARE_WORK(orphan_delete_work, try_delete_orphan_caches);
> > +
> > +static void check_orphan_stat(void)
> > +{
> > +	if (atomic_read(&nr_orphan_caches) > orphan_thresh())
> > +		schedule_work(&orphan_delete_work);
> > +}
> > +
> > +static __init void init_orphan_lru(void)
> > +{
> > +	struct orphan_list_node *opl;
> > +	int nid, zid;
> > +
> > +	for_each_node_state(nid, N_POSSIBLE) {
> > +		opl = kmalloc(sizeof(struct orphan_list_node),  GFP_KERNEL);
> > +		BUG_ON(!opl);
> > +		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> > +			INIT_LIST_HEAD(&opl->zone[zid].list);
> > +			opl->zone[zid].event = 0;
> > +		}
> > +		orphan_list[nid] = opl;
> > +	}
> > +}
> > +
> >  /*
> >   * Visit the first child (need not be the first child as per the ordering
> >   * of the cgroup list, since we track last_scanned_child) of @mem and use
> > @@ -2454,6 +2636,7 @@ mem_cgroup_create(struct cgroup_subsys *
> >  	/* root ? */
> >  	if (cont->parent == NULL) {
> >  		enable_swap_cgroup();
> > +		init_orphan_lru();
> >  		parent = NULL;
> >  	} else {
> >  		parent = mem_cgroup_from_cont(cont->parent);
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v1 (Re: [RFC] memcg: handle swapcache leak
  2009-03-18 23:45                   ` KAMEZAWA Hiroyuki
@ 2009-03-19  2:16                     ` KAMEZAWA Hiroyuki
  2009-03-19  9:06                       ` [PATCH] fix unused/stale swap cache handling on memcg v2 KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-19  2:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Thu, 19 Mar 2009 08:45:23 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > > Changes:
> > >  - modified condition to trigger reclaim orphan paes.
> > > 
> > > If I get good answer, I'll repost this with CC: to Andrew.
> > > 
> > It looks good to me.
> > 
> > Unfortunately, I don't have enough time tomorrow.
> > I'll test this during this weekend and report the result in next week.
> > 
> Okay, thank you very much. I'll review this again.
> 
I noticed a bug....will post v2.

Thanks,
-Kame

> Thanks,
> -Kame
> 
> 
> > 
> > Thanks,
> > Daisuke Nishimura.
> > 
> > > ==
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > Nishimura reported unused-swap-cache is not reclaimed well under memcg.
> > > 
> > > Assume that memory cgroup well limits the memory usage of all applications
> > > and file caches, and global-LRU-scan (kswapd() etc..) never runs.
> > > 
> > > First, there is *allowed* race to SwapCache on global LRU. There can be
> > > SwapCaches on global LRU, even when swp_entry is not referred by anyone(ptes).
> > > When global LRU scan runs, it will be reclaimed by try_to_free_swap().
> > > But, they will not appear in memcg's private LRU and never reclaimed by
> > > memcg's reclaim routines.
> > > 
> > > Second, there are readahead SwapCaches, some of then tend to be not used
> > > and reclaimed by global LRU when scan runs, at last. But they are not on
> > > memcg's private LRU and will not be reclaimed until global-lru-scan runs.
> > > 
> > > From memcg's point of view, above 2 is not very good. Especially, *unused*
> > > swp_entry adds pressure to memcg's mem+swap controller and finally cause OOM.
> > > (Nishimura confirmed this can cause OOM.)
> > > 
> > > This patch tries to reclaim unused-swapcache by 
> > >   - add a list for unused-swapcache (orphan_list)
> > >   - try to recalim orhan list by some threshold.
> > > 
> > > Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  include/linux/page_cgroup.h |   13 +++
> > >  mm/memcontrol.c             |  185 +++++++++++++++++++++++++++++++++++++++++++-
> > >  2 files changed, 197 insertions(+), 1 deletion(-)
> > > 
> > > Index: mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> > > ===================================================================
> > > --- mmotm-2.6.29-Mar13.orig/include/linux/page_cgroup.h
> > > +++ mmotm-2.6.29-Mar13/include/linux/page_cgroup.h
> > > @@ -26,6 +26,7 @@ enum {
> > >  	PCG_LOCK,  /* page cgroup is locked */
> > >  	PCG_CACHE, /* charged as cache */
> > >  	PCG_USED, /* this object is in use. */
> > > +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
> > >  };
> > >  
> > >  #define TESTPCGFLAG(uname, lname)			\
> > > @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
> > >  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
> > >  	{ clear_bit(PCG_##lname, &pc->flags);  }
> > >  
> > > +#define TESTSETPCGFLAG(uname, lname) \
> > > +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> > > +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> > > +
> > > +#define TESTCLEARPCGFLAG(uname, lname) \
> > > +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> > > +	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
> > > +
> > >  /* Cache flag is set only once (at allocation) */
> > >  TESTPCGFLAG(Cache, CACHE)
> > >  
> > >  TESTPCGFLAG(Used, USED)
> > >  CLEARPCGFLAG(Used, USED)
> > >  
> > > +TESTPCGFLAG(Orphan, ORPHAN)
> > > +TESTSETPCGFLAG(Orphan, ORPHAN)
> > > +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> > > +
> > >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> > >  {
> > >  	return page_to_nid(pc->page);
> > > Index: mmotm-2.6.29-Mar13/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-2.6.29-Mar13.orig/mm/memcontrol.c
> > > +++ mmotm-2.6.29-Mar13/mm/memcontrol.c
> > > @@ -371,6 +371,61 @@ static int mem_cgroup_walk_tree(struct m
> > >   * When moving account, the page is not on LRU. It's isolated.
> > >   */
> > >  
> > > +/*
> > > + * Orphan List is a list for page_cgroup which is not free but not under
> > > + * any cgroup. SwapCache which is prefetched by readahead() is typical type but
> > > + * there are other corner cases.
> > > + *
> > > + * Usually, updates to this list happens when swap cache is readaheaded and
> > > + * finally used by process.
> > > + */
> > > +
> > > +/* for orphan page_cgroups, updated under zone->lru_lock. */
> > > +
> > > +struct orphan_list_node {
> > > +	struct orphan_list_zone {
> > > +		int event;
> > > +		struct list_head list;
> > > +	} zone[MAX_NR_ZONES];
> > > +};
> > > +struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
> > > +#define ORPHAN_EVENT_THRESH (256)
> > > +static void check_orphan_stat(void);
> > > +atomic_t nr_orphan_caches;
> > > +
> > > +static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
> > > +{
> > > +	/*
> > > +	 * 2 cases for this BUG_ON(), swapcache is generated while init.
> > > +	 * or NID should be invalid.
> > > +	 */
> > > +	BUG_ON(!orphan_list[nid]);
> > > +	return  &orphan_list[nid]->zone[zid];
> > > +}
> > > +
> > > +static inline void remove_orphan_list(struct page_cgroup *pc)
> > > +{
> > > +	if (TestClearPageCgroupOrphan(pc)) {
> > > +		list_del_init(&pc->lru);
> > > +		atomic_dec(&nr_orphan_caches);
> > > +	}
> > > +}
> > > +
> > > +static void add_orphan_list(struct page *page, struct page_cgroup *pc)
> > > +{
> > > +	if (TestSetPageCgroupOrphan(pc)) {
> > > +		struct orphan_list_zone *opl;
> > > +		opl = orphan_lru(page_to_nid(page), page_zonenum(page));
> > > +		list_add_tail(&pc->lru, &opl->list);
> > > +		atomic_inc(&nr_orphan_caches);
> > > +		if (opl->event++ > ORPHAN_EVENT_THRESH) {
> > > +			check_orphan_stat();
> > > +			opl->event = 0;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +
> > >  void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
> > >  {
> > >  	struct page_cgroup *pc;
> > > @@ -380,6 +435,14 @@ void mem_cgroup_del_lru_list(struct page
> > >  	if (mem_cgroup_disabled())
> > >  		return;
> > >  	pc = lookup_page_cgroup(page);
> > > +	/*
> > > +	 * If the page is SwapCache and already on global LRU, it will be on
> > > +	 * orphan list. remove here
> > > +	 */
> > > +	if (unlikely(PageCgroupOrphan(pc))) {
> > > +		remove_orphan_list(pc);
> > > +		return;
> > > +	}
> > >  	/* can happen while we handle swapcache. */
> > >  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
> > >  		return;
> > > @@ -433,8 +496,11 @@ void mem_cgroup_add_lru_list(struct page
> > >  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
> > >  	 */
> > >  	smp_rmb();
> > > -	if (!PageCgroupUsed(pc))
> > > +	if (!PageCgroupUsed(pc)) {
> > > +		/* handle swap cache here */
> > > +		add_orphan_list(page, pc);
> > >  		return;
> > > +	}
> > >  
> > >  	mz = page_cgroup_zoneinfo(pc);
> > >  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> > > @@ -471,6 +537,9 @@ static void mem_cgroup_lru_add_after_com
> > >  	struct page_cgroup *pc = lookup_page_cgroup(page);
> > >  
> > >  	spin_lock_irqsave(&zone->lru_lock, flags);
> > > +	if (PageCgroupOrphan(pc))
> > > +		remove_orphan_list(pc);
> > > +
> > >  	/* link when the page is linked to LRU but page_cgroup isn't */
> > >  	if (PageLRU(page) && list_empty(&pc->lru))
> > >  		mem_cgroup_add_lru_list(page, page_lru(page));
> > > @@ -784,6 +853,119 @@ static int mem_cgroup_count_children(str
> > >  	return num;
> > >  }
> > >  
> > > +
> > > +
> > > +/*
> > > + * Using big number here for avoiding to free orphan swap-cache by readahead
> > > + * We don't want to delete swap caches read by readahead.
> > > + */
> > > +static int orphan_thresh(void)
> > > +{
> > > +	int nr_pages = (1 << page_cluster); /* max size of a swap readahead */
> > > +	int base = num_online_cpus() * 256; /* 1M per cpu if swap is 4k */
> > > +
> > > +	nr_pages *= nr_threads/2; /* nr_threads can be too big, too small */
> > > +
> > > +	/* too small value will kill readahead */
> > > +	if (nr_pages < base)
> > > +		return base;
> > > +
> > > +	/* too big is not suitable here */
> > > +	if (nr_pages > base * 4)
> > > +		return base * 4;
> > > +
> > > +	return nr_pages;
> > > +}
> > > +
> > > +/*
> > > + * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
> > > + * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
> > > + */
> > > +static int drain_orphan_swapcaches(int nid, int zid)
> > > +{
> > > +	struct page_cgroup *pc;
> > > +	struct zone *zone;
> > > +	struct page *page;
> > > +	struct orphan_list_zone *lru = orphan_lru(nid, zid);
> > > +	unsigned long flags;
> > > +	int drain, scan;
> > > +
> > > +	zone = &NODE_DATA(nid)->node_zones[zid];
> > > +	/* check one by one */
> > > +	scan = 0;
> > > +	drain = 0;
> > > +	spin_lock_irqsave(&zone->lru_lock, flags);
> > > +	while (!list_empty(&lru->list) && (scan < SWAP_CLUSTER_MAX*2)) {
> > > +		scan++;
> > > +		pc = list_entry(lru->list.next, struct page_cgroup, lru);
> > > +		page = pc->page;
> > > +		/* Rotate */
> > > +		list_del(&pc->lru);
> > > +		list_add_tail(&pc->lru, &lru->list);
> > > +		/* get page for isolate_lru_page() */
> > > +		if (get_page_unless_zero(page)) {
> > > +			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > +			if (!isolate_lru_page(page)) {
> > > +				/* Now, This page is removed from LRU */
> > > +				if (trylock_page(page)) {
> > > +					drain += try_to_free_swap(page);
> > > +					unlock_page(page);
> > > +				}
> > > +				putback_lru_page(page);
> > > +			}
> > > +			put_page(page);
> > > +			spin_lock_irqsave(&zone->lru_lock, flags);
> > > +		}
> > > +	}
> > > +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > +
> > > +	return drain;
> > > +}
> > > +
> > > +/* access without lock...serialization is not so important here. */
> > > +static int last_visit;
> > > +void try_delete_orphan_caches(struct work_struct *work)
> > > +{
> > > +	int nid, zid, drain;
> > > +
> > > +	nid = last_visit;
> > > +	drain = 0;
> > > +	while (drain < SWAP_CLUSTER_MAX) {
> > > +		nid = next_node(nid, node_states[N_HIGH_MEMORY]);
> > > +		if (nid == MAX_NUMNODES)
> > > +			nid = 0;
> > > +		last_visit = nid;
> > > +		if (node_state(nid, N_HIGH_MEMORY))
> > > +			for (zid = 0; zid < MAX_NR_ZONES; zid++)
> > > +				drain += drain_orphan_swapcaches(nid, zid);
> > > +		if (nid == 0)
> > > +			break;
> > > +	}
> > > +}
> > > +DECLARE_WORK(orphan_delete_work, try_delete_orphan_caches);
> > > +
> > > +static void check_orphan_stat(void)
> > > +{
> > > +	if (atomic_read(&nr_orphan_caches) > orphan_thresh())
> > > +		schedule_work(&orphan_delete_work);
> > > +}
> > > +
> > > +static __init void init_orphan_lru(void)
> > > +{
> > > +	struct orphan_list_node *opl;
> > > +	int nid, zid;
> > > +
> > > +	for_each_node_state(nid, N_POSSIBLE) {
> > > +		opl = kmalloc(sizeof(struct orphan_list_node),  GFP_KERNEL);
> > > +		BUG_ON(!opl);
> > > +		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> > > +			INIT_LIST_HEAD(&opl->zone[zid].list);
> > > +			opl->zone[zid].event = 0;
> > > +		}
> > > +		orphan_list[nid] = opl;
> > > +	}
> > > +}
> > > +
> > >  /*
> > >   * Visit the first child (need not be the first child as per the ordering
> > >   * of the cgroup list, since we track last_scanned_child) of @mem and use
> > > @@ -2454,6 +2636,7 @@ mem_cgroup_create(struct cgroup_subsys *
> > >  	/* root ? */
> > >  	if (cont->parent == NULL) {
> > >  		enable_swap_cgroup();
> > > +		init_orphan_lru();
> > >  		parent = NULL;
> > >  	} else {
> > >  		parent = mem_cgroup_from_cont(cont->parent);
> > > 
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majordomo@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > > 
> > 
> > 
> > 
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] fix unused/stale swap cache handling on memcg  v2
  2009-03-19  2:16                     ` KAMEZAWA Hiroyuki
@ 2009-03-19  9:06                       ` KAMEZAWA Hiroyuki
  2009-03-19 10:01                         ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-19  9:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

Core logic are much improved and I confirmed this logic can reduce
orphan swap-caches. (But the patch size is bigger than expected.)
Long term test is required and we have to verify paramaters are reasonable
and whether this doesn't make swapped-out applications slow..

-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Nishimura reported unused-swap-cache is not reclaimed well well under memcg.

Assume that memory cgroup well limits the memory usage of all applications
and file caches, and global-LRU-scan (kswapd() etc..) never runs.

First, there is *allowed* race to SwapCache on global LRU. There can be
SwapCaches on global LRU, even when swp_entry is not referred by anyone(ptes).
When global LRU scan runs, it will be reclaimed by try_to_free_swap().
But, they will not appear in memcg's private LRU and never reclaimed by
memcg's reclaim routines.

Second, there are readahead SwapCaches, some of then tend to be not used
and reclaimed by global LRU when scan runs, at last. But they are not on
memcg's private LRU and will not be reclaimed until global-lru-scan runs.

>From memcg's point of view, above 2 is not very good. Especially, *unused*
swp_entry adds pressure to memcg's mem+swap controller and finally cause OOM.
(Nishimura confirmed this can cause OOM.)

This patch tries to reclaim unused-swapcache by 
  - add a list for unused-swapcache (orphan_list)
  - try to recalim orhan list by some threshold.

BTW, if we don't remove "2" (unused swapcache), we can't detect correct
threshold for reclaiming stale entries. So, the pages should be dropped
to some extent. try_to_free_swap() cannot be used for "2", so I added
try_to_drop_swapcache(). remove_mapping() checks all critical things.

Changelog: v1 -> v2
 - use kmalloc_node() instead of kmalloc()
 - added try_to_drop_swapcache()
 - fixed silly bugs.
 - If only root cgroup, no logic will work. (all jobs are done be global LRU)

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |   13 ++
 include/linux/swap.h        |    6 +
 mm/memcontrol.c             |  195 +++++++++++++++++++++++++++++++++++++++++++-
 mm/swapfile.c               |   23 +++++
 4 files changed, 236 insertions(+), 1 deletion(-)

Index: mmotm-2.6.29-Mar11/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.29-Mar11.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.29-Mar11/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
 static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
+
+#define TESTCLEARPCGFLAG(uname, lname) \
+static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
+	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
+TESTPCGFLAG(Orphan, ORPHAN)
+TESTSETPCGFLAG(Orphan, ORPHAN)
+TESTCLEARPCGFLAG(Orphan, ORPHAN)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
Index: mmotm-2.6.29-Mar11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Mar11.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Mar11/mm/memcontrol.c
@@ -371,6 +371,64 @@ static int mem_cgroup_walk_tree(struct m
  * When moving account, the page is not on LRU. It's isolated.
  */
 
+/*
+ * Orphan List is a list for page_cgroup which is not free but not under
+ * any cgroup. SwapCache which is prefetched by readahead() is typical type but
+ * there are other corner cases.
+ *
+ * Usually, updates to this list happens when swap cache is readaheaded and
+ * finally used by process.
+ */
+
+/* for orphan page_cgroups, updated under zone->lru_lock. */
+
+struct orphan_list_node {
+	struct orphan_list_zone {
+		int event;
+		struct list_head list;
+	} zone[MAX_NR_ZONES];
+};
+struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
+#define ORPHAN_EVENT_THRESH (256)
+static void check_orphan_stat(void);
+static atomic_t nr_orphan_caches;
+static int memory_cgroup_is_used __read_mostly;
+
+static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
+{
+	/*
+	 * 2 cases for this BUG_ON(), swapcache is generated while init.
+	 * or NID should be invalid.
+	 */
+	BUG_ON(!orphan_list[nid]);
+	return  &orphan_list[nid]->zone[zid];
+}
+
+static inline void remove_orphan_list(struct page_cgroup *pc)
+{
+	if (TestClearPageCgroupOrphan(pc)) {
+		list_del_init(&pc->lru);
+		atomic_dec(&nr_orphan_caches);
+	}
+}
+
+static void add_orphan_list(struct page *page, struct page_cgroup *pc)
+{
+	if (!TestSetPageCgroupOrphan(pc)) {
+		struct orphan_list_zone *opl;
+		opl = orphan_lru(page_to_nid(page), page_zonenum(page));
+		list_add_tail(&pc->lru, &opl->list);
+		atomic_inc(&nr_orphan_caches);
+		if (unlikely(opl->event++ > ORPHAN_EVENT_THRESH)) {
+			/* Orphan is not problem if no mem_cgroup is used */
+			if (memory_cgroup_is_used)
+				check_orphan_stat();
+			opl->event = 0;
+		}
+	}
+}
+
+
 void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
@@ -380,6 +438,14 @@ void mem_cgroup_del_lru_list(struct page
 	if (mem_cgroup_disabled())
 		return;
 	pc = lookup_page_cgroup(page);
+	/*
+	 * If the page is SwapCache and already on global LRU, it will be on
+	 * orphan list. remove here
+	 */
+	if (unlikely(PageCgroupOrphan(pc))) {
+		remove_orphan_list(pc);
+		return;
+	}
 	/* can happen while we handle swapcache. */
 	if (list_empty(&pc->lru) || !pc->mem_cgroup)
 		return;
@@ -433,8 +499,11 @@ void mem_cgroup_add_lru_list(struct page
 	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
 	 */
 	smp_rmb();
-	if (!PageCgroupUsed(pc))
+	if (!PageCgroupUsed(pc)) {
+		/* handle swap cache here */
+		add_orphan_list(page, pc);
 		return;
+	}
 
 	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -471,6 +540,9 @@ static void mem_cgroup_lru_add_after_com
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 
 	spin_lock_irqsave(&zone->lru_lock, flags);
+	if (PageCgroupOrphan(pc))
+		remove_orphan_list(pc);
+
 	/* link when the page is linked to LRU but page_cgroup isn't */
 	if (PageLRU(page) && list_empty(&pc->lru))
 		mem_cgroup_add_lru_list(page, page_lru(page));
@@ -785,6 +857,125 @@ static int mem_cgroup_count_children(str
 }
 
 /*
+ * Using big number here for avoiding to free orphan swap-cache by readahead
+ * We don't want to delete swap caches read by readahead.
+ */
+static int orphan_thresh(void)
+{
+	int nr_pages = (1 << page_cluster); /* max size of a swap readahead */
+	int base = num_online_cpus() * 256; /* 1M per cpu if swap is 4k */
+
+	nr_pages *= nr_threads; /* nr_threads can be too big, too small */
+
+	/* too small value will kill readahead */
+	if (nr_pages < base)
+		return base;
+
+	/* too big is not suitable here */
+	if (nr_pages > base * 4)
+		return base * 4;
+
+	return nr_pages;
+}
+
+/*
+ * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
+ * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
+ */
+static int drain_orphan_swapcaches(int nid, int zid)
+{
+	struct page_cgroup *pc;
+	struct zone *zone;
+	struct page *page;
+	struct orphan_list_zone *lru = orphan_lru(nid, zid);
+	unsigned long flags;
+	int drain, scan;
+
+	zone = &NODE_DATA(nid)->node_zones[zid];
+	scan = ORPHAN_EVENT_THRESH/2;
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	while (!list_empty(&lru->list) && (scan > 0)) {
+		scan--;
+		pc = list_entry(lru->list.next, struct page_cgroup, lru);
+		page = pc->page;
+		/* Rotate */
+		list_del(&pc->lru);
+		list_add_tail(&pc->lru, &lru->list);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		/* Remove from LRU */
+		if (!isolate_lru_page(page)) { /* get_page is called */
+			if (!page_mapped(page) && trylock_page(page)) {
+				/* This does all necessary jobs */
+				drain += try_to_drop_swapcache(page);
+				unlock_page(page);
+			}
+			putback_lru_page(page); /* put_page is called */
+		}
+		spin_lock_irqsave(&zone->lru_lock, flags);
+	}
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+
+	return drain;
+}
+
+/*
+ * last_visit is marker to remember which node should be scanned next.
+ * Only one worker can enter this routine at the same time.
+ */
+static int last_visit;
+void try_delete_orphan_caches(struct work_struct *work)
+{
+	int nid, zid, drain;
+	static atomic_t orphan_scan_worker;
+
+	if (atomic_inc_return(&orphan_scan_worker) > 1) {
+		atomic_dec(&orphan_scan_worker);
+		return;
+	}
+	nid = last_visit;
+	drain = 0;
+	while (!drain) {
+		nid = next_node(nid, node_states[N_HIGH_MEMORY]);
+		if (nid == MAX_NUMNODES)
+			nid = 0;
+		last_visit = nid;
+		if (node_state(nid, N_HIGH_MEMORY))
+			for (zid = 0; zid < MAX_NR_ZONES; zid++)
+				drain += drain_orphan_swapcaches(nid, zid);
+		if (nid == 0)
+			break;
+	}
+	atomic_dec(&orphan_scan_worker);
+}
+DECLARE_WORK(orphan_delete_work, try_delete_orphan_caches);
+
+static void check_orphan_stat(void)
+{
+	if (atomic_read(&nr_orphan_caches) > orphan_thresh())
+		schedule_work(&orphan_delete_work);
+}
+
+static __init void init_orphan_lru(void)
+{
+	struct orphan_list_node *opl;
+	int nid, zid;
+	int size = sizeof(struct orphan_list_node);
+
+	for_each_node_state(nid, N_POSSIBLE) {
+		if (node_state(nid, N_NORMAL_MEMORY))
+			opl = kmalloc_node(size,  GFP_KERNEL, nid);
+		else
+			opl = kmalloc(size, GFP_KERNEL);
+		BUG_ON(!opl);
+		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+			INIT_LIST_HEAD(&opl->zone[zid].list);
+			opl->zone[zid].event = 0;
+		}
+		orphan_list[nid] = opl;
+	}
+}
+
+/*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
  * that to reclaim free pages from.
@@ -2454,10 +2645,12 @@ mem_cgroup_create(struct cgroup_subsys *
 	/* root ? */
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
+		init_orphan_lru();
 		parent = NULL;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
+		memory_cgroup_is_used = 1;
 	}
 
 	if (parent && parent->use_hierarchy) {
Index: mmotm-2.6.29-Mar11/mm/swapfile.c
===================================================================
--- mmotm-2.6.29-Mar11.orig/mm/swapfile.c
+++ mmotm-2.6.29-Mar11/mm/swapfile.c
@@ -571,6 +571,29 @@ int try_to_free_swap(struct page *page)
 }
 
 /*
+ * Similar to try_to_free_swap() but this drops SwapCache without checking
+ * page_swapcount(). By this, this function removes not only unused swap entry
+ * but alos a swap-cache which is on memory but never used.
+ * The caller should have a reference to this page and it must be locked.
+ */
+int try_to_drop_swapcache(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+
+	if (!PageSwapCache(page))
+		return 0;
+	if (PageWriteback(page))
+		return 0;
+	if (page_mapped(page))
+		return 0;
+	/*
+	 * remove_mapping() will success only when there is no extra
+ 	 * user of swap cache. (Keeping sanity be speculative lookup)
+ 	 */
+	return remove_mapping(&swapper_space, page);
+}
+
+/*
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.
  */
Index: mmotm-2.6.29-Mar11/include/linux/swap.h
===================================================================
--- mmotm-2.6.29-Mar11.orig/include/linux/swap.h
+++ mmotm-2.6.29-Mar11/include/linux/swap.h
@@ -312,6 +312,7 @@ extern sector_t swapdev_block(int, pgoff
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
+extern int try_to_drop_swapcache(struct page *);
 struct backing_dev_info;
 
 /* linux/mm/thrash.c */
@@ -414,6 +415,11 @@ static inline int try_to_free_swap(struc
 	return 0;
 }
 
+static inline int try_to_drop_swapcache(struct page *page)
+{
+	return 0;
+}
+
 static inline swp_entry_t get_swap_page(void)
 {
 	swp_entry_t entry;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v2
  2009-03-19  9:06                       ` [PATCH] fix unused/stale swap cache handling on memcg v2 KAMEZAWA Hiroyuki
@ 2009-03-19 10:01                         ` Daisuke Nishimura
  2009-03-19 10:13                           ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-19 10:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Thu, 19 Mar 2009 18:06:31 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Core logic are much improved and I confirmed this logic can reduce
> orphan swap-caches. (But the patch size is bigger than expected.)
> Long term test is required and we have to verify paramaters are reasonable
> and whether this doesn't make swapped-out applications slow..
> 
Thank you for your patch.
I'll test this version and check what happens about swapcache usage.

Thanks,
Daisuke Nishimura.

> -Kame
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Nishimura reported unused-swap-cache is not reclaimed well well under memcg.
> 
> Assume that memory cgroup well limits the memory usage of all applications
> and file caches, and global-LRU-scan (kswapd() etc..) never runs.
> 
> First, there is *allowed* race to SwapCache on global LRU. There can be
> SwapCaches on global LRU, even when swp_entry is not referred by anyone(ptes).
> When global LRU scan runs, it will be reclaimed by try_to_free_swap().
> But, they will not appear in memcg's private LRU and never reclaimed by
> memcg's reclaim routines.
> 
> Second, there are readahead SwapCaches, some of then tend to be not used
> and reclaimed by global LRU when scan runs, at last. But they are not on
> memcg's private LRU and will not be reclaimed until global-lru-scan runs.
> 
> From memcg's point of view, above 2 is not very good. Especially, *unused*
> swp_entry adds pressure to memcg's mem+swap controller and finally cause OOM.
> (Nishimura confirmed this can cause OOM.)
> 
> This patch tries to reclaim unused-swapcache by 
>   - add a list for unused-swapcache (orphan_list)
>   - try to recalim orhan list by some threshold.
> 
> BTW, if we don't remove "2" (unused swapcache), we can't detect correct
> threshold for reclaiming stale entries. So, the pages should be dropped
> to some extent. try_to_free_swap() cannot be used for "2", so I added
> try_to_drop_swapcache(). remove_mapping() checks all critical things.
> 
> Changelog: v1 -> v2
>  - use kmalloc_node() instead of kmalloc()
>  - added try_to_drop_swapcache()
>  - fixed silly bugs.
>  - If only root cgroup, no logic will work. (all jobs are done be global LRU)
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/page_cgroup.h |   13 ++
>  include/linux/swap.h        |    6 +
>  mm/memcontrol.c             |  195 +++++++++++++++++++++++++++++++++++++++++++-
>  mm/swapfile.c               |   23 +++++
>  4 files changed, 236 insertions(+), 1 deletion(-)
> 
> Index: mmotm-2.6.29-Mar11/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.29-Mar11.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.29-Mar11/include/linux/page_cgroup.h
> @@ -26,6 +26,7 @@ enum {
>  	PCG_LOCK,  /* page cgroup is locked */
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
> +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(
>  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname) \
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> +
> +#define TESTCLEARPCGFLAG(uname, lname) \
> +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> +	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
> +
>  /* Cache flag is set only once (at allocation) */
>  TESTPCGFLAG(Cache, CACHE)
>  
>  TESTPCGFLAG(Used, USED)
>  CLEARPCGFLAG(Used, USED)
>  
> +TESTPCGFLAG(Orphan, ORPHAN)
> +TESTSETPCGFLAG(Orphan, ORPHAN)
> +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> Index: mmotm-2.6.29-Mar11/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.29-Mar11.orig/mm/memcontrol.c
> +++ mmotm-2.6.29-Mar11/mm/memcontrol.c
> @@ -371,6 +371,64 @@ static int mem_cgroup_walk_tree(struct m
>   * When moving account, the page is not on LRU. It's isolated.
>   */
>  
> +/*
> + * Orphan List is a list for page_cgroup which is not free but not under
> + * any cgroup. SwapCache which is prefetched by readahead() is typical type but
> + * there are other corner cases.
> + *
> + * Usually, updates to this list happens when swap cache is readaheaded and
> + * finally used by process.
> + */
> +
> +/* for orphan page_cgroups, updated under zone->lru_lock. */
> +
> +struct orphan_list_node {
> +	struct orphan_list_zone {
> +		int event;
> +		struct list_head list;
> +	} zone[MAX_NR_ZONES];
> +};
> +struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
> +#define ORPHAN_EVENT_THRESH (256)
> +static void check_orphan_stat(void);
> +static atomic_t nr_orphan_caches;
> +static int memory_cgroup_is_used __read_mostly;
> +
> +static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
> +{
> +	/*
> +	 * 2 cases for this BUG_ON(), swapcache is generated while init.
> +	 * or NID should be invalid.
> +	 */
> +	BUG_ON(!orphan_list[nid]);
> +	return  &orphan_list[nid]->zone[zid];
> +}
> +
> +static inline void remove_orphan_list(struct page_cgroup *pc)
> +{
> +	if (TestClearPageCgroupOrphan(pc)) {
> +		list_del_init(&pc->lru);
> +		atomic_dec(&nr_orphan_caches);
> +	}
> +}
> +
> +static void add_orphan_list(struct page *page, struct page_cgroup *pc)
> +{
> +	if (!TestSetPageCgroupOrphan(pc)) {
> +		struct orphan_list_zone *opl;
> +		opl = orphan_lru(page_to_nid(page), page_zonenum(page));
> +		list_add_tail(&pc->lru, &opl->list);
> +		atomic_inc(&nr_orphan_caches);
> +		if (unlikely(opl->event++ > ORPHAN_EVENT_THRESH)) {
> +			/* Orphan is not problem if no mem_cgroup is used */
> +			if (memory_cgroup_is_used)
> +				check_orphan_stat();
> +			opl->event = 0;
> +		}
> +	}
> +}
> +
> +
>  void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
>  {
>  	struct page_cgroup *pc;
> @@ -380,6 +438,14 @@ void mem_cgroup_del_lru_list(struct page
>  	if (mem_cgroup_disabled())
>  		return;
>  	pc = lookup_page_cgroup(page);
> +	/*
> +	 * If the page is SwapCache and already on global LRU, it will be on
> +	 * orphan list. remove here
> +	 */
> +	if (unlikely(PageCgroupOrphan(pc))) {
> +		remove_orphan_list(pc);
> +		return;
> +	}
>  	/* can happen while we handle swapcache. */
>  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
>  		return;
> @@ -433,8 +499,11 @@ void mem_cgroup_add_lru_list(struct page
>  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
>  	 */
>  	smp_rmb();
> -	if (!PageCgroupUsed(pc))
> +	if (!PageCgroupUsed(pc)) {
> +		/* handle swap cache here */
> +		add_orphan_list(page, pc);
>  		return;
> +	}
>  
>  	mz = page_cgroup_zoneinfo(pc);
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> @@ -471,6 +540,9 @@ static void mem_cgroup_lru_add_after_com
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>  
>  	spin_lock_irqsave(&zone->lru_lock, flags);
> +	if (PageCgroupOrphan(pc))
> +		remove_orphan_list(pc);
> +
>  	/* link when the page is linked to LRU but page_cgroup isn't */
>  	if (PageLRU(page) && list_empty(&pc->lru))
>  		mem_cgroup_add_lru_list(page, page_lru(page));
> @@ -785,6 +857,125 @@ static int mem_cgroup_count_children(str
>  }
>  
>  /*
> + * Using big number here for avoiding to free orphan swap-cache by readahead
> + * We don't want to delete swap caches read by readahead.
> + */
> +static int orphan_thresh(void)
> +{
> +	int nr_pages = (1 << page_cluster); /* max size of a swap readahead */
> +	int base = num_online_cpus() * 256; /* 1M per cpu if swap is 4k */
> +
> +	nr_pages *= nr_threads; /* nr_threads can be too big, too small */
> +
> +	/* too small value will kill readahead */
> +	if (nr_pages < base)
> +		return base;
> +
> +	/* too big is not suitable here */
> +	if (nr_pages > base * 4)
> +		return base * 4;
> +
> +	return nr_pages;
> +}
> +
> +/*
> + * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
> + * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
> + */
> +static int drain_orphan_swapcaches(int nid, int zid)
> +{
> +	struct page_cgroup *pc;
> +	struct zone *zone;
> +	struct page *page;
> +	struct orphan_list_zone *lru = orphan_lru(nid, zid);
> +	unsigned long flags;
> +	int drain, scan;
> +
> +	zone = &NODE_DATA(nid)->node_zones[zid];
> +	scan = ORPHAN_EVENT_THRESH/2;
> +	spin_lock_irqsave(&zone->lru_lock, flags);
> +	while (!list_empty(&lru->list) && (scan > 0)) {
> +		scan--;
> +		pc = list_entry(lru->list.next, struct page_cgroup, lru);
> +		page = pc->page;
> +		/* Rotate */
> +		list_del(&pc->lru);
> +		list_add_tail(&pc->lru, &lru->list);
> +		spin_unlock_irqrestore(&zone->lru_lock, flags);
> +		/* Remove from LRU */
> +		if (!isolate_lru_page(page)) { /* get_page is called */
> +			if (!page_mapped(page) && trylock_page(page)) {
> +				/* This does all necessary jobs */
> +				drain += try_to_drop_swapcache(page);
> +				unlock_page(page);
> +			}
> +			putback_lru_page(page); /* put_page is called */
> +		}
> +		spin_lock_irqsave(&zone->lru_lock, flags);
> +	}
> +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> +
> +	return drain;
> +}
> +
> +/*
> + * last_visit is marker to remember which node should be scanned next.
> + * Only one worker can enter this routine at the same time.
> + */
> +static int last_visit;
> +void try_delete_orphan_caches(struct work_struct *work)
> +{
> +	int nid, zid, drain;
> +	static atomic_t orphan_scan_worker;
> +
> +	if (atomic_inc_return(&orphan_scan_worker) > 1) {
> +		atomic_dec(&orphan_scan_worker);
> +		return;
> +	}
> +	nid = last_visit;
> +	drain = 0;
> +	while (!drain) {
> +		nid = next_node(nid, node_states[N_HIGH_MEMORY]);
> +		if (nid == MAX_NUMNODES)
> +			nid = 0;
> +		last_visit = nid;
> +		if (node_state(nid, N_HIGH_MEMORY))
> +			for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +				drain += drain_orphan_swapcaches(nid, zid);
> +		if (nid == 0)
> +			break;
> +	}
> +	atomic_dec(&orphan_scan_worker);
> +}
> +DECLARE_WORK(orphan_delete_work, try_delete_orphan_caches);
> +
> +static void check_orphan_stat(void)
> +{
> +	if (atomic_read(&nr_orphan_caches) > orphan_thresh())
> +		schedule_work(&orphan_delete_work);
> +}
> +
> +static __init void init_orphan_lru(void)
> +{
> +	struct orphan_list_node *opl;
> +	int nid, zid;
> +	int size = sizeof(struct orphan_list_node);
> +
> +	for_each_node_state(nid, N_POSSIBLE) {
> +		if (node_state(nid, N_NORMAL_MEMORY))
> +			opl = kmalloc_node(size,  GFP_KERNEL, nid);
> +		else
> +			opl = kmalloc(size, GFP_KERNEL);
> +		BUG_ON(!opl);
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +			INIT_LIST_HEAD(&opl->zone[zid].list);
> +			opl->zone[zid].event = 0;
> +		}
> +		orphan_list[nid] = opl;
> +	}
> +}
> +
> +/*
>   * Visit the first child (need not be the first child as per the ordering
>   * of the cgroup list, since we track last_scanned_child) of @mem and use
>   * that to reclaim free pages from.
> @@ -2454,10 +2645,12 @@ mem_cgroup_create(struct cgroup_subsys *
>  	/* root ? */
>  	if (cont->parent == NULL) {
>  		enable_swap_cgroup();
> +		init_orphan_lru();
>  		parent = NULL;
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
> +		memory_cgroup_is_used = 1;
>  	}
>  
>  	if (parent && parent->use_hierarchy) {
> Index: mmotm-2.6.29-Mar11/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.29-Mar11.orig/mm/swapfile.c
> +++ mmotm-2.6.29-Mar11/mm/swapfile.c
> @@ -571,6 +571,29 @@ int try_to_free_swap(struct page *page)
>  }
>  
>  /*
> + * Similar to try_to_free_swap() but this drops SwapCache without checking
> + * page_swapcount(). By this, this function removes not only unused swap entry
> + * but alos a swap-cache which is on memory but never used.
> + * The caller should have a reference to this page and it must be locked.
> + */
> +int try_to_drop_swapcache(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
> +
> +	if (!PageSwapCache(page))
> +		return 0;
> +	if (PageWriteback(page))
> +		return 0;
> +	if (page_mapped(page))
> +		return 0;
> +	/*
> +	 * remove_mapping() will success only when there is no extra
> + 	 * user of swap cache. (Keeping sanity be speculative lookup)
> + 	 */
> +	return remove_mapping(&swapper_space, page);
> +}
> +
> +/*
>   * Free the swap entry like above, but also try to
>   * free the page cache entry if it is the last user.
>   */
> Index: mmotm-2.6.29-Mar11/include/linux/swap.h
> ===================================================================
> --- mmotm-2.6.29-Mar11.orig/include/linux/swap.h
> +++ mmotm-2.6.29-Mar11/include/linux/swap.h
> @@ -312,6 +312,7 @@ extern sector_t swapdev_block(int, pgoff
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
>  extern int reuse_swap_page(struct page *);
>  extern int try_to_free_swap(struct page *);
> +extern int try_to_drop_swapcache(struct page *);
>  struct backing_dev_info;
>  
>  /* linux/mm/thrash.c */
> @@ -414,6 +415,11 @@ static inline int try_to_free_swap(struc
>  	return 0;
>  }
>  
> +static inline int try_to_drop_swapcache(struct page *page)
> +{
> +	return 0;
> +}
> +
>  static inline swp_entry_t get_swap_page(void)
>  {
>  	swp_entry_t entry;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v2
  2009-03-19 10:01                         ` Daisuke Nishimura
@ 2009-03-19 10:13                           ` Daisuke Nishimura
  2009-03-19 10:46                             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-19 10:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Thu, 19 Mar 2009 19:01:18 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Thu, 19 Mar 2009 18:06:31 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Core logic are much improved and I confirmed this logic can reduce
> > orphan swap-caches. (But the patch size is bigger than expected.)
> > Long term test is required and we have to verify paramaters are reasonable
> > and whether this doesn't make swapped-out applications slow..
> > 
> Thank you for your patch.
> I'll test this version and check what happens about swapcache usage.
> 
hmm... underflow of inactive_anon seems to happen after a while.
I've not done anything but causing memory pressure yet.

[nishimura@GibsonE ~]$ cat /cgroup/memory/01/memory.stat
cache 22994944
rss 10559488
pgpgin 2301009
pgpgout 2292817
active_anon 21004288
inactive_anon 18446744073709510656
active_file 1605632
inactive_file 10944512
unevictable 0
hierarchical_memory_limit 33554432
hierarchical_memsw_limit 50331648
inactive_ratio 1
recent_rotated_anon 857
recent_rotated_file 10
recent_scanned_anon 877
recent_scanned_file 400


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v2
  2009-03-19 10:13                           ` Daisuke Nishimura
@ 2009-03-19 10:46                             ` KAMEZAWA Hiroyuki
  2009-03-19 11:36                               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-19 10:46 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, Balbir Singh,
	Hugh Dickins

Daisuke Nishimura さんは書きました:
> On Thu, 19 Mar 2009 19:01:18 +0900, Daisuke Nishimura
> <nishimura@mxp.nes.nec.co.jp> wrote:
>> On Thu, 19 Mar 2009 18:06:31 +0900, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > Core logic are much improved and I confirmed this logic can reduce
>> > orphan swap-caches. (But the patch size is bigger than expected.)
>> > Long term test is required and we have to verify paramaters are
>> reasonable
>> > and whether this doesn't make swapped-out applications slow..
>> >
>> Thank you for your patch.
>> I'll test this version and check what happens about swapcache usage.
>>
> hmm... underflow of inactive_anon seems to happen after a while.
> I've not done anything but causing memory pressure yet.
>
Hmm..maybe I miss something. maybe mem_cgroup_commit_charge() removes
Orphan flag implicitly.

I'll dig but may not be able to post a patch in this week.

Thanks,
-Kame



> [nishimura@GibsonE ~]$ cat /cgroup/memory/01/memory.stat
> cache 22994944
> rss 10559488
> pgpgin 2301009
> pgpgout 2292817
> active_anon 21004288
> inactive_anon 18446744073709510656
> active_file 1605632
> inactive_file 10944512
> unevictable 0
> hierarchical_memory_limit 33554432
> hierarchical_memsw_limit 50331648
> inactive_ratio 1
> recent_rotated_anon 857
> recent_rotated_file 10
> recent_scanned_anon 877
> recent_scanned_file 400
>
>
> Thanks,
> Daisuke Nishimura.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v2
  2009-03-19 10:46                             ` KAMEZAWA Hiroyuki
@ 2009-03-19 11:36                               ` KAMEZAWA Hiroyuki
  2009-03-20  7:45                                 ` [PATCH] fix unused/stale swap cache handling on memcg v3 KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-19 11:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Daisuke Nishimura, linux-mm, Balbir Singh,
	Hugh Dickins

KAMEZAWA Hiroyuki さんは書きました:
> Daisuke Nishimura さんは書きました:
>> On Thu, 19 Mar 2009 19:01:18 +0900, Daisuke Nishimura
>> <nishimura@mxp.nes.nec.co.jp> wrote:
>>> On Thu, 19 Mar 2009 18:06:31 +0900, KAMEZAWA Hiroyuki
>>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> > Core logic are much improved and I confirmed this logic can reduce
>>> > orphan swap-caches. (But the patch size is bigger than expected.)
>>> > Long term test is required and we have to verify paramaters are
>>> reasonable
>>> > and whether this doesn't make swapped-out applications slow..
>>> >
>>> Thank you for your patch.
>>> I'll test this version and check what happens about swapcache usage.
>>>
>> hmm... underflow of inactive_anon seems to happen after a while.
>> I've not done anything but causing memory pressure yet.
>>
> Hmm..maybe I miss something. maybe mem_cgroup_commit_charge() removes
> Orphan flag implicitly.
>
I couldn't repoduce, hmm..but yes there would be something racy.

> I'll dig but may not be able to post a patch in this week.
>
The more I consider, the more the code is complicated.
Sigh...I'd like to find another way, if I can.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-03-19 11:36                               ` KAMEZAWA Hiroyuki
@ 2009-03-20  7:45                                 ` KAMEZAWA Hiroyuki
  2009-03-23  1:45                                   ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-20  7:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Daisuke Nishimura, linux-mm, Balbir Singh,
	Hugh Dickins

I'll test this one in this week end.
Maybe much simpler than previous ones. Thank you for all your help!

-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Nishimura reported that, in racy case, swap cache is not freed even if
it will be never used. For making use of laziness of LRU, some racy pages
are not freed _interntionally_ and the kernel expects the global LRU will
reclaim it later.

When it comes to memcg, if well controlled, global LRU will not work very
often and above "ok, it's busy, reclaim it later by Global LRU" logic means
leak of swp_entry. Nishimura found that this can cause OOM.

This patch tries to fix this by calling try_to_free_swap() againt the
stale page caches.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    2 ++
 mm/memcontrol.c      |   41 +++++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c        |   23 ++++++++++++++++++-----
 mm/vmscan.c          |    9 +++++++++
 4 files changed, 70 insertions(+), 5 deletions(-)

Index: mmotm-2.6.29-Mar11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Mar11.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Mar11/mm/memcontrol.c
@@ -1550,8 +1550,49 @@ void mem_cgroup_uncharge_swap(swp_entry_
 	}
 	rcu_read_unlock();
 }
+
 #endif
 
+/* For handle some racy case. */
+struct memcg_swap_validate {
+	struct work_struct work;
+	struct page *page;
+};
+
+static void mem_cgroup_validate_swapcache_cb(struct work_struct *work)
+{
+	struct memcg_swap_validate *mywork;
+	struct page *page;
+
+	mywork = container_of(work, struct memcg_swap_validate, work);
+	page = mywork->page;
+	/* We can wait lock now....validate swap is still alive or not */
+	lock_page(page);
+	try_to_free_swap(page);
+	unlock_page(page);
+	put_page(page);
+	kfree(mywork);
+	return;
+}
+
+void mem_cgroup_validate_swapcache(struct page *page)
+{
+	struct memcg_swap_validate *work;
+	/*
+	 * Unfortunately, we cannot lock this page here. So, schedule this
+	 * again later.
+	 */
+	get_page(page);
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (work) {
+		INIT_WORK(&work->work, mem_cgroup_validate_swapcache_cb);
+		work->page = page;
+		schedule_work(&work->work);
+	} else /* If this small kmalloc() fails, LRU will work and find this */
+		put_page(page);
+	return;
+}
+
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
Index: mmotm-2.6.29-Mar11/mm/swapfile.c
===================================================================
--- mmotm-2.6.29-Mar11.orig/mm/swapfile.c
+++ mmotm-2.6.29-Mar11/mm/swapfile.c
@@ -578,6 +578,7 @@ int free_swap_and_cache(swp_entry_t entr
 {
 	struct swap_info_struct *p;
 	struct page *page = NULL;
+	struct page *check = NULL;
 
 	if (is_migration_entry(entry))
 		return 1;
@@ -586,9 +587,11 @@ int free_swap_and_cache(swp_entry_t entr
 	if (p) {
 		if (swap_entry_free(p, entry) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
-			if (page && !trylock_page(page)) {
-				page_cache_release(page);
-				page = NULL;
+			if (page) {
+				if (!trylock_page(page)) {
+					check = page;
+					page = NULL;
+				}
 			}
 		}
 		spin_unlock(&swap_lock);
@@ -602,10 +605,20 @@ int free_swap_and_cache(swp_entry_t entr
 				(!page_mapped(page) || vm_swap_full())) {
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
-		}
+		} else
+			check = page;
 		unlock_page(page);
-		page_cache_release(page);
+		if (!check)
+			page_cache_release(page);
 	}
+
+	if (check) {
+		/* Check accounting of this page in lazy way.*/
+		if (PageSwapCache(check) && !page_mapped(check))
+			mem_cgroup_validate_swapcache(check);
+		page_cache_release(check);
+	}
+
 	return p != NULL;
 }
 
Index: mmotm-2.6.29-Mar11/mm/vmscan.c
===================================================================
--- mmotm-2.6.29-Mar11.orig/mm/vmscan.c
+++ mmotm-2.6.29-Mar11/mm/vmscan.c
@@ -782,6 +782,15 @@ activate_locked:
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
+		/*
+		 * This can happen under racy case between unmap and us. If
+		 * a page is added to swapcache while it's unmmaped, the page
+		 * may reach here. Check again this page(swap) is worth to be
+		 * kept.
+		 * (Is this needed to be only under memcg ?
+		 */
+		if (PageSwapCache(page) && !page_mapped(page))
+			try_to_free_swap(page);
 		unlock_page(page);
 keep:
 		list_add(&page->lru, &ret_pages);
Index: mmotm-2.6.29-Mar11/include/linux/swap.h
===================================================================
--- mmotm-2.6.29-Mar11.orig/include/linux/swap.h
+++ mmotm-2.6.29-Mar11/include/linux/swap.h
@@ -337,11 +337,13 @@ static inline void disable_swap_token(vo
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern void mem_cgroup_validate_swapcache(struct page *page);
 #else
 static inline void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 {
 }
+static inline void mem_cgroup_validate_swapcache(struct page *page) {}
 #endif
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern void mem_cgroup_uncharge_swap(swp_entry_t ent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-03-20  7:45                                 ` [PATCH] fix unused/stale swap cache handling on memcg v3 KAMEZAWA Hiroyuki
@ 2009-03-23  1:45                                   ` Daisuke Nishimura
  2009-03-23  2:41                                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-23  1:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

It might be a nitpick, but I think this patch cannot handle the case:

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
      swap_entry_free() == 1           |
      find_get_page() -> cannot find   |
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.

And I think we should avoid changing non-memcg code as long as possible,
so I prefer orphan list approach.

The cause of an odd behavior of the previous orphan list patch is 
the race between commit_charge_swapin and lru_add.
PCG_ORPHAN flags is set by lru_add, so it cannot be seen while the
page is on pvec(on another cpu). So, lru_del_before_commit_swapcache
cannot remove the pc from orphan list.

How about this patch ?
It(rebased on mmotm) worked well(applied -rc8+some patches) during last night
on my note-pc(i386, 2CPU, 2GB).

- change cache_charge to use try_charge_swapin() and commit_charge_swapin()
  when PageSwapCache.
- keep zone->lru_lock while commit_charge if needed to protect PCG_ORPHAN.

It only introduces orphan list and doesn't implement the reclaim part.

I can post a patch only for changing cache_charge if you want.


Thanks,
Daisuke Nishimura.
===
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |   13 +++
 mm/memcontrol.c             |  203 ++++++++++++++++++++++++++-----------------
 2 files changed, 135 insertions(+), 81 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..47ad25c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
 static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
+
+#define TESTCLEARPCGFLAG(uname, lname) \
+static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
+	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
+TESTPCGFLAG(Orphan, ORPHAN)
+TESTSETPCGFLAG(Orphan, ORPHAN)
+TESTCLEARPCGFLAG(Orphan, ORPHAN)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 55dea59..39cf11f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -371,6 +371,50 @@ static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
  * When moving account, the page is not on LRU. It's isolated.
  */
 
+/*
+ * Orphan List is a list for page_cgroup which is not free but not under
+ * any cgroup. SwapCache which is prefetched by readahead() is typical type but
+ * there are other corner cases.
+ *
+ * Usually, updates to this list happens when swap cache is readaheaded and
+ * finally used by process.
+ */
+
+/* for orphan page_cgroups, updated under zone->lru_lock. */
+
+struct orphan_list_node {
+	struct orphan_list_zone {
+		struct list_head list;
+	} zone[MAX_NR_ZONES];
+};
+struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
+
+static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
+{
+	/*
+	 * 2 cases for this BUG_ON(), swapcache is generated while init.
+	 * or NID should be invalid.
+	 */
+	BUG_ON(!orphan_list[nid]);
+	return  &orphan_list[nid]->zone[zid];
+}
+
+static inline void remove_orphan_list(struct page_cgroup *pc)
+{
+	if (TestClearPageCgroupOrphan(pc))
+		list_del_init(&pc->lru);
+}
+
+static void add_orphan_list(struct page *page, struct page_cgroup *pc)
+{
+	if (!TestSetPageCgroupOrphan(pc)) {
+		struct orphan_list_zone *opl;
+		opl = orphan_lru(page_to_nid(page), page_zonenum(page));
+		list_add_tail(&pc->lru, &opl->list);
+	}
+}
+
+
 void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
@@ -380,6 +424,14 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 	if (mem_cgroup_disabled())
 		return;
 	pc = lookup_page_cgroup(page);
+	/*
+	 * If the page is SwapCache and already on global LRU, it will be on
+	 * orphan list. remove here
+	 */
+	if (unlikely(PageCgroupOrphan(pc))) {
+		remove_orphan_list(pc);
+		return;
+	}
 	/* can happen while we handle swapcache. */
 	if (list_empty(&pc->lru) || !pc->mem_cgroup)
 		return;
@@ -433,51 +485,17 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
 	 */
 	smp_rmb();
-	if (!PageCgroupUsed(pc))
-		return;
+	if (!PageCgroupUsed(pc)) {
+		/* handle swap cache here */
+		add_orphan_list(page, pc);
+ 		return;
+	}
 
 	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
 }
 
-/*
- * At handling SwapCache, pc->mem_cgroup may be changed while it's linked to
- * lru because the page may.be reused after it's fully uncharged (because of
- * SwapCache behavior).To handle that, unlink page_cgroup from LRU when charge
- * it again. This function is only used to charge SwapCache. It's done under
- * lock_page and expected that zone->lru_lock is never held.
- */
-static void mem_cgroup_lru_del_before_commit_swapcache(struct page *page)
-{
-	unsigned long flags;
-	struct zone *zone = page_zone(page);
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-
-	spin_lock_irqsave(&zone->lru_lock, flags);
-	/*
-	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
-	 * is guarded by lock_page() because the page is SwapCache.
-	 */
-	if (!PageCgroupUsed(pc))
-		mem_cgroup_del_lru_list(page, page_lru(page));
-	spin_unlock_irqrestore(&zone->lru_lock, flags);
-}
-
-static void mem_cgroup_lru_add_after_commit_swapcache(struct page *page)
-{
-	unsigned long flags;
-	struct zone *zone = page_zone(page);
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-
-	spin_lock_irqsave(&zone->lru_lock, flags);
-	/* link when the page is linked to LRU but page_cgroup isn't */
-	if (PageLRU(page) && list_empty(&pc->lru))
-		mem_cgroup_add_lru_list(page, page_lru(page));
-	spin_unlock_irqrestore(&zone->lru_lock, flags);
-}
-
-
 void mem_cgroup_move_lists(struct page *page,
 			   enum lru_list from, enum lru_list to)
 {
@@ -784,6 +802,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
 	return num;
 }
 
+static __init void init_orphan_lru(void)
+{
+	struct orphan_list_node *opl;
+	int nid, zid;
+	int size = sizeof(struct orphan_list_node);
+
+	for_each_node_state(nid, N_POSSIBLE) {
+		if (node_state(nid, N_NORMAL_MEMORY))
+			opl = kmalloc_node(size,  GFP_KERNEL, nid);
+		else
+			opl = kmalloc(size, GFP_KERNEL);
+		BUG_ON(!opl);
+		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+			INIT_LIST_HEAD(&opl->zone[zid].list);
+		orphan_list[nid] = opl;
+	}
+}
+
 /*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -1238,6 +1274,10 @@ int mem_cgroup_newpage_charge(struct page *page,
 				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
+static void
+__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
+					enum charge_type ctype);
+
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
@@ -1274,16 +1314,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		unlock_page_cgroup(pc);
 	}
 
-	if (do_swap_account && PageSwapCache(page)) {
-		mem = try_get_mem_cgroup_from_swapcache(page);
-		if (mem)
-			mm = NULL;
-		  else
-			mem = NULL;
-		/* SwapCache may be still linked to LRU now. */
-		mem_cgroup_lru_del_before_commit_swapcache(page);
-	}
-
 	if (unlikely(!mm && !mem))
 		mm = &init_mm;
 
@@ -1291,32 +1321,16 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		return mem_cgroup_charge_common(page, mm, gfp_mask,
 				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
 
-	ret = mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
-	if (mem)
-		css_put(&mem->css);
-	if (PageSwapCache(page))
-		mem_cgroup_lru_add_after_commit_swapcache(page);
+	/* shmem */
+	if (PageSwapCache(page)) {
+		ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
+		if (!ret)
+			__mem_cgroup_commit_charge_swapin(page, mem,
+					MEM_CGROUP_CHARGE_TYPE_SHMEM);
+	} else
+		ret = mem_cgroup_charge_common(page, mm, gfp_mask,
+					MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
 
-	if (do_swap_account && !ret && PageSwapCache(page)) {
-		swp_entry_t ent = {.val = page_private(page)};
-		unsigned short id;
-		/* avoid double counting */
-		id = swap_cgroup_record(ent, 0);
-		rcu_read_lock();
-		mem = mem_cgroup_lookup(id);
-		if (mem) {
-			/*
-			 * We did swap-in. Then, this entry is doubly counted
-			 * both in mem and memsw. We uncharge it, here.
-			 * Recorded ID can be obsolete. We avoid calling
-			 * css_tryget()
-			 */
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-			mem_cgroup_put(mem);
-		}
-		rcu_read_unlock();
-	}
 	return ret;
 }
 
@@ -1359,18 +1373,40 @@ charge_cur_mm:
 	return __mem_cgroup_try_charge(mm, mask, ptr, true);
 }
 
-void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
+static void
+__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
+					enum charge_type ctype)
 {
-	struct page_cgroup *pc;
+	unsigned long flags;
+	struct zone *zone = page_zone(page);
+	struct page_cgroup *pc = lookup_page_cgroup(page);
+	int locked = 0;
 
 	if (mem_cgroup_disabled())
 		return;
 	if (!ptr)
 		return;
-	pc = lookup_page_cgroup(page);
-	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
-	mem_cgroup_lru_add_after_commit_swapcache(page);
+
+	/*
+	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
+	 * is guarded by lock_page() because the page is SwapCache.
+	 * If this pc is on orphan LRU, it is also removed from orphan LRU here.
+	 */
+	if (!PageCgroupUsed(pc)) {
+		locked = 1;
+		spin_lock_irqsave(&zone->lru_lock, flags);
+		mem_cgroup_del_lru_list(page, page_lru(page));
+	}
+
+	__mem_cgroup_commit_charge(ptr, pc, ctype);
+
+	if (locked) {
+		/* link when the page is linked to LRU but page_cgroup isn't */
+		if (PageLRU(page) && list_empty(&pc->lru))
+			mem_cgroup_add_lru_list(page, page_lru(page));
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
+	}
+
 	/*
 	 * Now swap is on-memory. This means this page may be
 	 * counted both as mem and swap....double count.
@@ -1396,8 +1432,12 @@ void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 		}
 		rcu_read_unlock();
 	}
-	/* add this page(page_cgroup) to the LRU we want. */
+}
 
+void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
+{
+	__mem_cgroup_commit_charge_swapin(page, ptr,
+					MEM_CGROUP_CHARGE_TYPE_MAPPED);
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -2452,6 +2492,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	/* root ? */
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
+		init_orphan_lru();
 		parent = NULL;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-03-23  1:45                                   ` Daisuke Nishimura
@ 2009-03-23  2:41                                     ` KAMEZAWA Hiroyuki
  2009-03-23  5:04                                       ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  2:41 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Mon, 23 Mar 2009 10:45:55 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> It might be a nitpick, but I think this patch cannot handle the case:
> 
>             processA                   |           processB
>   -------------------------------------+-------------------------------------
>     (free_swap_and_cache())            |  (read_swap_cache_async())
>                                        |    swap_duplicate()
>       swap_entry_free() == 1           |
>       find_get_page() -> cannot find   |
>                                        |    __set_page_locked()
>                                        |    add_to_swap_cache()
>                                        |    lru_cache_add_anon()
>                                        |      doesn't link this page to memcg's
>                                        |      LRU, because of !PageCgroupUsed.
> 
> And I think we should avoid changing non-memcg code as long as possible,
> so I prefer orphan list approach.
> 
> The cause of an odd behavior of the previous orphan list patch is 
> the race between commit_charge_swapin and lru_add.
> PCG_ORPHAN flags is set by lru_add, so it cannot be seen while the
> page is on pvec(on another cpu). So, lru_del_before_commit_swapcache
> cannot remove the pc from orphan list.
> 
> How about this patch ?
> It(rebased on mmotm) worked well(applied -rc8+some patches) during last night
> on my note-pc(i386, 2CPU, 2GB).
> 
I trust your test and the patch seems to be good.  I have some nitpicks.


> - change cache_charge to use try_charge_swapin() and commit_charge_swapin()
>   when PageSwapCache.
> - keep zone->lru_lock while commit_charge if needed to protect PCG_ORPHAN.
> 
> It only introduces orphan list and doesn't implement the reclaim part.
> 
> I can post a patch only for changing cache_charge if you want.
> 
> 
> Thanks,
> Daisuke Nishimura.
> ===
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/page_cgroup.h |   13 +++
>  mm/memcontrol.c             |  203 ++++++++++++++++++++++++++-----------------
>  2 files changed, 135 insertions(+), 81 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 7339c7b..47ad25c 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -26,6 +26,7 @@ enum {
>  	PCG_LOCK,  /* page cgroup is locked */
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
> +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
>  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname) \
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> +
> +#define TESTCLEARPCGFLAG(uname, lname) \
> +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> +	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
> +
>  /* Cache flag is set only once (at allocation) */
>  TESTPCGFLAG(Cache, CACHE)
>  
>  TESTPCGFLAG(Used, USED)
>  CLEARPCGFLAG(Used, USED)
>  
> +TESTPCGFLAG(Orphan, ORPHAN)
> +TESTSETPCGFLAG(Orphan, ORPHAN)
> +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> +

This TESTCLEAR, TESTSET is not necessary in this approarch.
SETPCGFLAG() and CLEARPCGFLAG() seems to be enough.
All changes (including commit) is under zone->lru_lock.



>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 55dea59..39cf11f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -371,6 +371,50 @@ static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
>   * When moving account, the page is not on LRU. It's isolated.
>   */
>  
> +/*
> + * Orphan List is a list for page_cgroup which is not free but not under
> + * any cgroup. SwapCache which is prefetched by readahead() is typical type but
> + * there are other corner cases.
> + *
> + * Usually, updates to this list happens when swap cache is readaheaded and
> + * finally used by process.
> + */
> +
> +/* for orphan page_cgroups, updated under zone->lru_lock. */
> +
> +struct orphan_list_node {
> +	struct orphan_list_zone {
> +		struct list_head list;
> +	} zone[MAX_NR_ZONES];
> +};
> +struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
> +
> +static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
> +{
> +	/*
> +	 * 2 cases for this BUG_ON(), swapcache is generated while init.
> +	 * or NID should be invalid.
> +	 */
> +	BUG_ON(!orphan_list[nid]);
> +	return  &orphan_list[nid]->zone[zid];
> +}
> +
> +static inline void remove_orphan_list(struct page_cgroup *pc)
> +{
> +	if (TestClearPageCgroupOrphan(pc))
> +		list_del_init(&pc->lru);
> +}
> +
> +static void add_orphan_list(struct page *page, struct page_cgroup *pc)
> +{
> +	if (!TestSetPageCgroupOrphan(pc)) {
> +		struct orphan_list_zone *opl;
> +		opl = orphan_lru(page_to_nid(page), page_zonenum(page));
> +		list_add_tail(&pc->lru, &opl->list);
> +	}
> +}
> +
> +
>  void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
>  {
>  	struct page_cgroup *pc;
> @@ -380,6 +424,14 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
>  	if (mem_cgroup_disabled())
>  		return;
>  	pc = lookup_page_cgroup(page);
> +	/*
> +	 * If the page is SwapCache and already on global LRU, it will be on
> +	 * orphan list. remove here
> +	 */
> +	if (unlikely(PageCgroupOrphan(pc))) {
> +		remove_orphan_list(pc);
> +		return;
> +	}
>  	/* can happen while we handle swapcache. */
>  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
>  		return;
> @@ -433,51 +485,17 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
>  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
>  	 */
>  	smp_rmb();
> -	if (!PageCgroupUsed(pc))
> -		return;
> +	if (!PageCgroupUsed(pc)) {
> +		/* handle swap cache here */
> +		add_orphan_list(page, pc);
> + 		return;
> +	}
>  
>  	mz = page_cgroup_zoneinfo(pc);
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
>  	list_add(&pc->lru, &mz->lists[lru]);
>  }
>  
> -/*
> - * At handling SwapCache, pc->mem_cgroup may be changed while it's linked to
> - * lru because the page may.be reused after it's fully uncharged (because of
> - * SwapCache behavior).To handle that, unlink page_cgroup from LRU when charge
> - * it again. This function is only used to charge SwapCache. It's done under
> - * lock_page and expected that zone->lru_lock is never held.
> - */
> -static void mem_cgroup_lru_del_before_commit_swapcache(struct page *page)
> -{
> -	unsigned long flags;
> -	struct zone *zone = page_zone(page);
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -
> -	spin_lock_irqsave(&zone->lru_lock, flags);
> -	/*
> -	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
> -	 * is guarded by lock_page() because the page is SwapCache.
> -	 */
> -	if (!PageCgroupUsed(pc))
> -		mem_cgroup_del_lru_list(page, page_lru(page));
> -	spin_unlock_irqrestore(&zone->lru_lock, flags);
> -}
> -
> -static void mem_cgroup_lru_add_after_commit_swapcache(struct page *page)
> -{
> -	unsigned long flags;
> -	struct zone *zone = page_zone(page);
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -
> -	spin_lock_irqsave(&zone->lru_lock, flags);
> -	/* link when the page is linked to LRU but page_cgroup isn't */
> -	if (PageLRU(page) && list_empty(&pc->lru))
> -		mem_cgroup_add_lru_list(page, page_lru(page));
> -	spin_unlock_irqrestore(&zone->lru_lock, flags);
> -}
> -
> -
>  void mem_cgroup_move_lists(struct page *page,
>  			   enum lru_list from, enum lru_list to)
>  {
> @@ -784,6 +802,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
>  	return num;
>  }
>  
> +static __init void init_orphan_lru(void)
> +{
> +	struct orphan_list_node *opl;
> +	int nid, zid;
> +	int size = sizeof(struct orphan_list_node);
> +
> +	for_each_node_state(nid, N_POSSIBLE) {
> +		if (node_state(nid, N_NORMAL_MEMORY))
> +			opl = kmalloc_node(size,  GFP_KERNEL, nid);
> +		else
> +			opl = kmalloc(size, GFP_KERNEL);
> +		BUG_ON(!opl);
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			INIT_LIST_HEAD(&opl->zone[zid].list);
> +		orphan_list[nid] = opl;
> +	}
> +}
> +
>  /*
>   * Visit the first child (need not be the first child as per the ordering
>   * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -1238,6 +1274,10 @@ int mem_cgroup_newpage_charge(struct page *page,
>  				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
>  }
>  
> +static void
> +__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> +					enum charge_type ctype);
> +
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask)
>  {
> @@ -1274,16 +1314,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  		unlock_page_cgroup(pc);
>  	}
>  
> -	if (do_swap_account && PageSwapCache(page)) {
> -		mem = try_get_mem_cgroup_from_swapcache(page);
> -		if (mem)
> -			mm = NULL;
> -		  else
> -			mem = NULL;
> -		/* SwapCache may be still linked to LRU now. */
> -		mem_cgroup_lru_del_before_commit_swapcache(page);
> -	}
> -
>  	if (unlikely(!mm && !mem))
>  		mm = &init_mm;
>  
> @@ -1291,32 +1321,16 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  		return mem_cgroup_charge_common(page, mm, gfp_mask,
>  				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
>  
> -	ret = mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
> -	if (mem)
> -		css_put(&mem->css);
> -	if (PageSwapCache(page))
> -		mem_cgroup_lru_add_after_commit_swapcache(page);
> +	/* shmem */
> +	if (PageSwapCache(page)) {
> +		ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
> +		if (!ret)
> +			__mem_cgroup_commit_charge_swapin(page, mem,
> +					MEM_CGROUP_CHARGE_TYPE_SHMEM);
> +	} else
> +		ret = mem_cgroup_charge_common(page, mm, gfp_mask,
> +					MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
>  
> -	if (do_swap_account && !ret && PageSwapCache(page)) {
> -		swp_entry_t ent = {.val = page_private(page)};
> -		unsigned short id;
> -		/* avoid double counting */
> -		id = swap_cgroup_record(ent, 0);
> -		rcu_read_lock();
> -		mem = mem_cgroup_lookup(id);
> -		if (mem) {
> -			/*
> -			 * We did swap-in. Then, this entry is doubly counted
> -			 * both in mem and memsw. We uncharge it, here.
> -			 * Recorded ID can be obsolete. We avoid calling
> -			 * css_tryget()
> -			 */
> -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> -			mem_cgroup_put(mem);
> -		}
> -		rcu_read_unlock();
> -	}
>  	return ret;
>  }
>  
Nice clean-up here :)


> @@ -1359,18 +1373,40 @@ charge_cur_mm:
>  	return __mem_cgroup_try_charge(mm, mask, ptr, true);
>  }
>  
> -void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> +static void
> +__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> +					enum charge_type ctype)
>  {
> -	struct page_cgroup *pc;
> +	unsigned long flags;
> +	struct zone *zone = page_zone(page);
> +	struct page_cgroup *pc = lookup_page_cgroup(page);
> +	int locked = 0;
>  
>  	if (mem_cgroup_disabled())
>  		return;
>  	if (!ptr)
>  		return;
> -	pc = lookup_page_cgroup(page);
> -	mem_cgroup_lru_del_before_commit_swapcache(page);
> -	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> -	mem_cgroup_lru_add_after_commit_swapcache(page);
> +
> +	/*
> +	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
> +	 * is guarded by lock_page() because the page is SwapCache.
> +	 * If this pc is on orphan LRU, it is also removed from orphan LRU here.
> +	 */
> +	if (!PageCgroupUsed(pc)) {
> +		locked = 1;
> +		spin_lock_irqsave(&zone->lru_lock, flags);
> +		mem_cgroup_del_lru_list(page, page_lru(page));
> +	}
Maybe nice. I tried to use lock_page_cgroup() in add_list but I can't ;(
I think this works well. But I wonder...why you have to check PageCgroupUsed() ?
And is it correct ? Removing PageCgroupUsed() bit check is nice.
(This will be "usually returns true" check, anyway)

> +
> +	__mem_cgroup_commit_charge(ptr, pc, ctype);
> +


> +	if (locked) {
> +		/* link when the page is linked to LRU but page_cgroup isn't */
> +		if (PageLRU(page) && list_empty(&pc->lru))
> +			mem_cgroup_add_lru_list(page, page_lru(page));
> +		spin_unlock_irqrestore(&zone->lru_lock, flags);
> +	}
> +

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-03-23  2:41                                     ` KAMEZAWA Hiroyuki
@ 2009-03-23  5:04                                       ` Daisuke Nishimura
  2009-03-23  5:22                                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-23  5:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

> > @@ -40,12 +41,24 @@ static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
> >  static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
> >  	{ clear_bit(PCG_##lname, &pc->flags);  }
> >  
> > +#define TESTSETPCGFLAG(uname, lname) \
> > +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
> > +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> > +
> > +#define TESTCLEARPCGFLAG(uname, lname) \
> > +static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
> > +	{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
> > +
> >  /* Cache flag is set only once (at allocation) */
> >  TESTPCGFLAG(Cache, CACHE)
> >  
> >  TESTPCGFLAG(Used, USED)
> >  CLEARPCGFLAG(Used, USED)
> >  
> > +TESTPCGFLAG(Orphan, ORPHAN)
> > +TESTSETPCGFLAG(Orphan, ORPHAN)
> > +TESTCLEARPCGFLAG(Orphan, ORPHAN)
> > +
> 
> This TESTCLEAR, TESTSET is not necessary in this approarch.
> SETPCGFLAG() and CLEARPCGFLAG() seems to be enough.
> All changes (including commit) is under zone->lru_lock.
> 
Okey.

> > @@ -1238,6 +1274,10 @@ int mem_cgroup_newpage_charge(struct page *page,
> >  				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> >  }
> >  
> > +static void
> > +__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> > +					enum charge_type ctype);
> > +
> >  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> >  				gfp_t gfp_mask)
> >  {
> > @@ -1274,16 +1314,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> >  		unlock_page_cgroup(pc);
> >  	}
> >  
> > -	if (do_swap_account && PageSwapCache(page)) {
> > -		mem = try_get_mem_cgroup_from_swapcache(page);
> > -		if (mem)
> > -			mm = NULL;
> > -		  else
> > -			mem = NULL;
> > -		/* SwapCache may be still linked to LRU now. */
> > -		mem_cgroup_lru_del_before_commit_swapcache(page);
> > -	}
> > -
> >  	if (unlikely(!mm && !mem))
> >  		mm = &init_mm;
> >  
> > @@ -1291,32 +1321,16 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> >  		return mem_cgroup_charge_common(page, mm, gfp_mask,
> >  				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
> >  
> > -	ret = mem_cgroup_charge_common(page, mm, gfp_mask,
> > -				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
> > -	if (mem)
> > -		css_put(&mem->css);
> > -	if (PageSwapCache(page))
> > -		mem_cgroup_lru_add_after_commit_swapcache(page);
> > +	/* shmem */
> > +	if (PageSwapCache(page)) {
> > +		ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
> > +		if (!ret)
> > +			__mem_cgroup_commit_charge_swapin(page, mem,
> > +					MEM_CGROUP_CHARGE_TYPE_SHMEM);
> > +	} else
> > +		ret = mem_cgroup_charge_common(page, mm, gfp_mask,
> > +					MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
> >  
> > -	if (do_swap_account && !ret && PageSwapCache(page)) {
> > -		swp_entry_t ent = {.val = page_private(page)};
> > -		unsigned short id;
> > -		/* avoid double counting */
> > -		id = swap_cgroup_record(ent, 0);
> > -		rcu_read_lock();
> > -		mem = mem_cgroup_lookup(id);
> > -		if (mem) {
> > -			/*
> > -			 * We did swap-in. Then, this entry is doubly counted
> > -			 * both in mem and memsw. We uncharge it, here.
> > -			 * Recorded ID can be obsolete. We avoid calling
> > -			 * css_tryget()
> > -			 */
> > -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > -			mem_cgroup_put(mem);
> > -		}
> > -		rcu_read_unlock();
> > -	}
> >  	return ret;
> >  }
> >  
> Nice clean-up here :)
> 
Thanks, I'll send a cleanup patch for this part later.

> > @@ -1359,18 +1373,40 @@ charge_cur_mm:
> >  	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> >  }
> >  
> > -void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > +static void
> > +__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> > +					enum charge_type ctype)
> >  {
> > -	struct page_cgroup *pc;
> > +	unsigned long flags;
> > +	struct zone *zone = page_zone(page);
> > +	struct page_cgroup *pc = lookup_page_cgroup(page);
> > +	int locked = 0;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  	if (!ptr)
> >  		return;
> > -	pc = lookup_page_cgroup(page);
> > -	mem_cgroup_lru_del_before_commit_swapcache(page);
> > -	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > -	mem_cgroup_lru_add_after_commit_swapcache(page);
> > +
> > +	/*
> > +	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
> > +	 * is guarded by lock_page() because the page is SwapCache.
> > +	 * If this pc is on orphan LRU, it is also removed from orphan LRU here.
> > +	 */
> > +	if (!PageCgroupUsed(pc)) {
> > +		locked = 1;
> > +		spin_lock_irqsave(&zone->lru_lock, flags);
> > +		mem_cgroup_del_lru_list(page, page_lru(page));
> > +	}
> Maybe nice. I tried to use lock_page_cgroup() in add_list but I can't ;(
> I think this works well. But I wonder...why you have to check PageCgroupUsed() ?
> And is it correct ? Removing PageCgroupUsed() bit check is nice.
> (This will be "usually returns true" check, anyway)
> 
I've just copied lru_del_before_commit_swapcache.

As you say, this check will return false only in (C) case in memcg_test.txt,
and even in (C) case calling mem_cgroup_del_lru_list(and mem_cgroup_add_lru_list later)
would be no problem.

OK, I'll remove this check.

This is the updated version(w/o cache_charge cleanup).

BTW, Should I merge reclaim part based on your patch and post it ?


Thanks,
Daisuke Nishimura.
===
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |    5 ++
 mm/memcontrol.c             |  137 +++++++++++++++++++++++++++++--------------
 2 files changed, 97 insertions(+), 45 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..e65e61e 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -46,6 +47,10 @@ TESTPCGFLAG(Cache, CACHE)
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
+TESTPCGFLAG(Orphan, ORPHAN)
+SETPCGFLAG(Orphan, ORPHAN)
+CLEARPCGFLAG(Orphan, ORPHAN)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2fc6d6c..3492286 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -371,6 +371,50 @@ static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
  * When moving account, the page is not on LRU. It's isolated.
  */
 
+/*
+ * Orphan List is a list for page_cgroup which is not free but not under
+ * any cgroup. SwapCache which is prefetched by readahead() is typical type but
+ * there are other corner cases.
+ *
+ * Usually, updates to this list happens when swap cache is readaheaded and
+ * finally used by process.
+ */
+
+/* for orphan page_cgroups, updated under zone->lru_lock. */
+
+struct orphan_list_node {
+	struct orphan_list_zone {
+		struct list_head list;
+	} zone[MAX_NR_ZONES];
+};
+struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
+
+static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
+{
+	/*
+	 * 2 cases for this BUG_ON(), swapcache is generated while init.
+	 * or NID should be invalid.
+	 */
+	BUG_ON(!orphan_list[nid]);
+	return  &orphan_list[nid]->zone[zid];
+}
+
+static inline void remove_orphan_list(struct page_cgroup *pc)
+{
+	ClearPageCgroupOrphan(pc);
+	list_del_init(&pc->lru);
+}
+
+static void add_orphan_list(struct page *page, struct page_cgroup *pc)
+{
+	struct orphan_list_zone *opl;
+
+	SetPageCgroupOrphan(pc);
+	opl = orphan_lru(page_to_nid(page), page_zonenum(page));
+	list_add_tail(&pc->lru, &opl->list);
+}
+
+
 void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
@@ -380,6 +424,14 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 	if (mem_cgroup_disabled())
 		return;
 	pc = lookup_page_cgroup(page);
+	/*
+	 * If the page is SwapCache and already on global LRU, it will be on
+	 * orphan list. remove here
+	 */
+	if (unlikely(PageCgroupOrphan(pc))) {
+		remove_orphan_list(pc);
+		return;
+	}
 	/* can happen while we handle swapcache. */
 	if (list_empty(&pc->lru) || !pc->mem_cgroup)
 		return;
@@ -433,51 +485,17 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
 	 */
 	smp_rmb();
-	if (!PageCgroupUsed(pc))
-		return;
+	if (!PageCgroupUsed(pc) && !PageCgroupOrphan(pc)) {
+		/* handle swap cache here */
+		add_orphan_list(page, pc);
+ 		return;
+	}
 
 	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
 }
 
-/*
- * At handling SwapCache, pc->mem_cgroup may be changed while it's linked to
- * lru because the page may.be reused after it's fully uncharged (because of
- * SwapCache behavior).To handle that, unlink page_cgroup from LRU when charge
- * it again. This function is only used to charge SwapCache. It's done under
- * lock_page and expected that zone->lru_lock is never held.
- */
-static void mem_cgroup_lru_del_before_commit_swapcache(struct page *page)
-{
-	unsigned long flags;
-	struct zone *zone = page_zone(page);
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-
-	spin_lock_irqsave(&zone->lru_lock, flags);
-	/*
-	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
-	 * is guarded by lock_page() because the page is SwapCache.
-	 */
-	if (!PageCgroupUsed(pc))
-		mem_cgroup_del_lru_list(page, page_lru(page));
-	spin_unlock_irqrestore(&zone->lru_lock, flags);
-}
-
-static void mem_cgroup_lru_add_after_commit_swapcache(struct page *page)
-{
-	unsigned long flags;
-	struct zone *zone = page_zone(page);
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-
-	spin_lock_irqsave(&zone->lru_lock, flags);
-	/* link when the page is linked to LRU but page_cgroup isn't */
-	if (PageLRU(page) && list_empty(&pc->lru))
-		mem_cgroup_add_lru_list(page, page_lru(page));
-	spin_unlock_irqrestore(&zone->lru_lock, flags);
-}
-
-
 void mem_cgroup_move_lists(struct page *page,
 			   enum lru_list from, enum lru_list to)
 {
@@ -784,6 +802,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
 	return num;
 }
 
+static __init void init_orphan_lru(void)
+{
+	struct orphan_list_node *opl;
+	int nid, zid;
+	int size = sizeof(struct orphan_list_node);
+
+	for_each_node_state(nid, N_POSSIBLE) {
+		if (node_state(nid, N_NORMAL_MEMORY))
+			opl = kmalloc_node(size,  GFP_KERNEL, nid);
+		else
+			opl = kmalloc(size, GFP_KERNEL);
+		BUG_ON(!opl);
+		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+			INIT_LIST_HEAD(&opl->zone[zid].list);
+		orphan_list[nid] = opl;
+	}
+}
+
 /*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -1341,16 +1377,28 @@ static void
 __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
 					enum charge_type ctype)
 {
-	struct page_cgroup *pc;
+	unsigned long flags;
+	struct zone *zone = page_zone(page);
+	struct page_cgroup *pc = lookup_page_cgroup(page);
 
 	if (mem_cgroup_disabled())
 		return;
 	if (!ptr)
 		return;
-	pc = lookup_page_cgroup(page);
-	mem_cgroup_lru_del_before_commit_swapcache(page);
+
+	/* If this pc is on orphan LRU, it is removed from orphan list here. */
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	mem_cgroup_del_lru_list(page, page_lru(page));
+
+	/* We should hold zone->lru_lock to protect PCG_ORPHAN. */
+	VM_BUG_ON(PageCgroupOrphan(pc));
 	__mem_cgroup_commit_charge(ptr, pc, ctype);
-	mem_cgroup_lru_add_after_commit_swapcache(page);
+
+	/* link when the page is linked to LRU but page_cgroup isn't */
+	if (PageLRU(page) && list_empty(&pc->lru))
+		mem_cgroup_add_lru_list(page, page_lru(page));
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+
 	/*
 	 * Now swap is on-memory. This means this page may be
 	 * counted both as mem and swap....double count.
@@ -1376,8 +1424,6 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
 		}
 		rcu_read_unlock();
 	}
-	/* add this page(page_cgroup) to the LRU we want. */
-
 }
 
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
@@ -2438,6 +2484,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	/* root ? */
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
+		init_orphan_lru();
 		parent = NULL;
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-03-23  5:04                                       ` Daisuke Nishimura
@ 2009-03-23  5:22                                         ` KAMEZAWA Hiroyuki
  2009-03-24  8:32                                           ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-23  5:22 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Mon, 23 Mar 2009 14:04:19 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > Nice clean-up here :)
> > 
> Thanks, I'll send a cleanup patch for this part later.
> 
Thank you, I'll look into.

> > > @@ -1359,18 +1373,40 @@ charge_cur_mm:
> > >  	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> > >  }
> > >  
> > > -void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > > +static void
> > > +__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> > > +					enum charge_type ctype)
> > >  {
> > > -	struct page_cgroup *pc;
> > > +	unsigned long flags;
> > > +	struct zone *zone = page_zone(page);
> > > +	struct page_cgroup *pc = lookup_page_cgroup(page);
> > > +	int locked = 0;
> > >  
> > >  	if (mem_cgroup_disabled())
> > >  		return;
> > >  	if (!ptr)
> > >  		return;
> > > -	pc = lookup_page_cgroup(page);
> > > -	mem_cgroup_lru_del_before_commit_swapcache(page);
> > > -	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > > -	mem_cgroup_lru_add_after_commit_swapcache(page);
> > > +
> > > +	/*
> > > +	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
> > > +	 * is guarded by lock_page() because the page is SwapCache.
> > > +	 * If this pc is on orphan LRU, it is also removed from orphan LRU here.
> > > +	 */
> > > +	if (!PageCgroupUsed(pc)) {
> > > +		locked = 1;
> > > +		spin_lock_irqsave(&zone->lru_lock, flags);
> > > +		mem_cgroup_del_lru_list(page, page_lru(page));
> > > +	}
> > Maybe nice. I tried to use lock_page_cgroup() in add_list but I can't ;(
> > I think this works well. But I wonder...why you have to check PageCgroupUsed() ?
> > And is it correct ? Removing PageCgroupUsed() bit check is nice.
> > (This will be "usually returns true" check, anyway)
> > 
> I've just copied lru_del_before_commit_swapcache.
> 
ya, considering now, it seems to be silly quick-hack.

> As you say, this check will return false only in (C) case in memcg_test.txt,
> and even in (C) case calling mem_cgroup_del_lru_list(and mem_cgroup_add_lru_list later)
> would be no problem.
> 
> OK, I'll remove this check.
> 
Thanks,

> This is the updated version(w/o cache_charge cleanup).
> 
> BTW, Should I merge reclaim part based on your patch and post it ?
> 
I think not necessary. keeping changes minimum is important as BUGFIX.
We can visit here again when new -RC stage starts.

no problem from my review.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thank you very much!
-Kame


> 
> Thanks,
> Daisuke Nishimura.
> ===
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/page_cgroup.h |    5 ++
>  mm/memcontrol.c             |  137 +++++++++++++++++++++++++++++--------------
>  2 files changed, 97 insertions(+), 45 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 7339c7b..e65e61e 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -26,6 +26,7 @@ enum {
>  	PCG_LOCK,  /* page cgroup is locked */
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
> +	PCG_ORPHAN, /* this is not used from memcg:s view but on global LRU */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -46,6 +47,10 @@ TESTPCGFLAG(Cache, CACHE)
>  TESTPCGFLAG(Used, USED)
>  CLEARPCGFLAG(Used, USED)
>  
> +TESTPCGFLAG(Orphan, ORPHAN)
> +SETPCGFLAG(Orphan, ORPHAN)
> +CLEARPCGFLAG(Orphan, ORPHAN)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2fc6d6c..3492286 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -371,6 +371,50 @@ static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
>   * When moving account, the page is not on LRU. It's isolated.
>   */
>  
> +/*
> + * Orphan List is a list for page_cgroup which is not free but not under
> + * any cgroup. SwapCache which is prefetched by readahead() is typical type but
> + * there are other corner cases.
> + *
> + * Usually, updates to this list happens when swap cache is readaheaded and
> + * finally used by process.
> + */
> +
> +/* for orphan page_cgroups, updated under zone->lru_lock. */
> +
> +struct orphan_list_node {
> +	struct orphan_list_zone {
> +		struct list_head list;
> +	} zone[MAX_NR_ZONES];
> +};
> +struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
> +
> +static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
> +{
> +	/*
> +	 * 2 cases for this BUG_ON(), swapcache is generated while init.
> +	 * or NID should be invalid.
> +	 */
> +	BUG_ON(!orphan_list[nid]);
> +	return  &orphan_list[nid]->zone[zid];
> +}
> +
> +static inline void remove_orphan_list(struct page_cgroup *pc)
> +{
> +	ClearPageCgroupOrphan(pc);
> +	list_del_init(&pc->lru);
> +}
> +
> +static void add_orphan_list(struct page *page, struct page_cgroup *pc)
> +{
> +	struct orphan_list_zone *opl;
> +
> +	SetPageCgroupOrphan(pc);
> +	opl = orphan_lru(page_to_nid(page), page_zonenum(page));
> +	list_add_tail(&pc->lru, &opl->list);
> +}
> +
> +
>  void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
>  {
>  	struct page_cgroup *pc;
> @@ -380,6 +424,14 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
>  	if (mem_cgroup_disabled())
>  		return;
>  	pc = lookup_page_cgroup(page);
> +	/*
> +	 * If the page is SwapCache and already on global LRU, it will be on
> +	 * orphan list. remove here
> +	 */
> +	if (unlikely(PageCgroupOrphan(pc))) {
> +		remove_orphan_list(pc);
> +		return;
> +	}
>  	/* can happen while we handle swapcache. */
>  	if (list_empty(&pc->lru) || !pc->mem_cgroup)
>  		return;
> @@ -433,51 +485,17 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
>  	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
>  	 */
>  	smp_rmb();
> -	if (!PageCgroupUsed(pc))
> -		return;
> +	if (!PageCgroupUsed(pc) && !PageCgroupOrphan(pc)) {
> +		/* handle swap cache here */
> +		add_orphan_list(page, pc);
> + 		return;
> +	}
>  
>  	mz = page_cgroup_zoneinfo(pc);
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
>  	list_add(&pc->lru, &mz->lists[lru]);
>  }
>  
> -/*
> - * At handling SwapCache, pc->mem_cgroup may be changed while it's linked to
> - * lru because the page may.be reused after it's fully uncharged (because of
> - * SwapCache behavior).To handle that, unlink page_cgroup from LRU when charge
> - * it again. This function is only used to charge SwapCache. It's done under
> - * lock_page and expected that zone->lru_lock is never held.
> - */
> -static void mem_cgroup_lru_del_before_commit_swapcache(struct page *page)
> -{
> -	unsigned long flags;
> -	struct zone *zone = page_zone(page);
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -
> -	spin_lock_irqsave(&zone->lru_lock, flags);
> -	/*
> -	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
> -	 * is guarded by lock_page() because the page is SwapCache.
> -	 */
> -	if (!PageCgroupUsed(pc))
> -		mem_cgroup_del_lru_list(page, page_lru(page));
> -	spin_unlock_irqrestore(&zone->lru_lock, flags);
> -}
> -
> -static void mem_cgroup_lru_add_after_commit_swapcache(struct page *page)
> -{
> -	unsigned long flags;
> -	struct zone *zone = page_zone(page);
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -
> -	spin_lock_irqsave(&zone->lru_lock, flags);
> -	/* link when the page is linked to LRU but page_cgroup isn't */
> -	if (PageLRU(page) && list_empty(&pc->lru))
> -		mem_cgroup_add_lru_list(page, page_lru(page));
> -	spin_unlock_irqrestore(&zone->lru_lock, flags);
> -}
> -
> -
>  void mem_cgroup_move_lists(struct page *page,
>  			   enum lru_list from, enum lru_list to)
>  {
> @@ -784,6 +802,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
>  	return num;
>  }
>  
> +static __init void init_orphan_lru(void)
> +{
> +	struct orphan_list_node *opl;
> +	int nid, zid;
> +	int size = sizeof(struct orphan_list_node);
> +
> +	for_each_node_state(nid, N_POSSIBLE) {
> +		if (node_state(nid, N_NORMAL_MEMORY))
> +			opl = kmalloc_node(size,  GFP_KERNEL, nid);
> +		else
> +			opl = kmalloc(size, GFP_KERNEL);
> +		BUG_ON(!opl);
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			INIT_LIST_HEAD(&opl->zone[zid].list);
> +		orphan_list[nid] = opl;
> +	}
> +}
> +
>  /*
>   * Visit the first child (need not be the first child as per the ordering
>   * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -1341,16 +1377,28 @@ static void
>  __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
>  					enum charge_type ctype)
>  {
> -	struct page_cgroup *pc;
> +	unsigned long flags;
> +	struct zone *zone = page_zone(page);
> +	struct page_cgroup *pc = lookup_page_cgroup(page);
>  
>  	if (mem_cgroup_disabled())
>  		return;
>  	if (!ptr)
>  		return;
> -	pc = lookup_page_cgroup(page);
> -	mem_cgroup_lru_del_before_commit_swapcache(page);
> +
> +	/* If this pc is on orphan LRU, it is removed from orphan list here. */
> +	spin_lock_irqsave(&zone->lru_lock, flags);
> +	mem_cgroup_del_lru_list(page, page_lru(page));
> +
> +	/* We should hold zone->lru_lock to protect PCG_ORPHAN. */
> +	VM_BUG_ON(PageCgroupOrphan(pc));
>  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> -	mem_cgroup_lru_add_after_commit_swapcache(page);
> +
> +	/* link when the page is linked to LRU but page_cgroup isn't */
> +	if (PageLRU(page) && list_empty(&pc->lru))
> +		mem_cgroup_add_lru_list(page, page_lru(page));
> +	spin_unlock_irqrestore(&zone->lru_lock, flags);
> +
>  	/*
>  	 * Now swap is on-memory. This means this page may be
>  	 * counted both as mem and swap....double count.
> @@ -1376,8 +1424,6 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
>  		}
>  		rcu_read_unlock();
>  	}
> -	/* add this page(page_cgroup) to the LRU we want. */
> -
>  }
>  
>  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> @@ -2438,6 +2484,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	/* root ? */
>  	if (cont->parent == NULL) {
>  		enable_swap_cgroup();
> +		init_orphan_lru();
>  		parent = NULL;
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-03-23  5:22                                         ` KAMEZAWA Hiroyuki
@ 2009-03-24  8:32                                           ` Daisuke Nishimura
  2009-03-24 23:57                                             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-03-24  8:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Mon, 23 Mar 2009 14:22:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 23 Mar 2009 14:04:19 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > Nice clean-up here :)
> > > 
> > Thanks, I'll send a cleanup patch for this part later.
> > 
> Thank you, I'll look into.
> 
> > > > @@ -1359,18 +1373,40 @@ charge_cur_mm:
> > > >  	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> > > >  }
> > > >  
> > > > -void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > > > +static void
> > > > +__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> > > > +					enum charge_type ctype)
> > > >  {
> > > > -	struct page_cgroup *pc;
> > > > +	unsigned long flags;
> > > > +	struct zone *zone = page_zone(page);
> > > > +	struct page_cgroup *pc = lookup_page_cgroup(page);
> > > > +	int locked = 0;
> > > >  
> > > >  	if (mem_cgroup_disabled())
> > > >  		return;
> > > >  	if (!ptr)
> > > >  		return;
> > > > -	pc = lookup_page_cgroup(page);
> > > > -	mem_cgroup_lru_del_before_commit_swapcache(page);
> > > > -	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > > > -	mem_cgroup_lru_add_after_commit_swapcache(page);
> > > > +
> > > > +	/*
> > > > +	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
> > > > +	 * is guarded by lock_page() because the page is SwapCache.
> > > > +	 * If this pc is on orphan LRU, it is also removed from orphan LRU here.
> > > > +	 */
> > > > +	if (!PageCgroupUsed(pc)) {
> > > > +		locked = 1;
> > > > +		spin_lock_irqsave(&zone->lru_lock, flags);
> > > > +		mem_cgroup_del_lru_list(page, page_lru(page));
> > > > +	}
> > > Maybe nice. I tried to use lock_page_cgroup() in add_list but I can't ;(
> > > I think this works well. But I wonder...why you have to check PageCgroupUsed() ?
> > > And is it correct ? Removing PageCgroupUsed() bit check is nice.
> > > (This will be "usually returns true" check, anyway)
> > > 
> > I've just copied lru_del_before_commit_swapcache.
> > 
> ya, considering now, it seems to be silly quick-hack.
> 
> > As you say, this check will return false only in (C) case in memcg_test.txt,
> > and even in (C) case calling mem_cgroup_del_lru_list(and mem_cgroup_add_lru_list later)
> > would be no problem.
> > 
> > OK, I'll remove this check.
> > 
> Thanks,
> 
> > This is the updated version(w/o cache_charge cleanup).
> > 
> > BTW, Should I merge reclaim part based on your patch and post it ?
> > 
> I think not necessary. keeping changes minimum is important as BUGFIX.
> We can visit here again when new -RC stage starts.
> 
> no problem from my review.
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
Just FYI, this version of orphan list framework works fine
w/o causing BUG more than 24h.

So, I believe we can implement reclaim part based on this
to fix the original problem.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-03-24  8:32                                           ` Daisuke Nishimura
@ 2009-03-24 23:57                                             ` KAMEZAWA Hiroyuki
  2009-04-17  6:34                                               ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-24 23:57 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Tue, 24 Mar 2009 17:32:18 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Mon, 23 Mar 2009 14:22:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 23 Mar 2009 14:04:19 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > > Nice clean-up here :)
> > > > 
> > > Thanks, I'll send a cleanup patch for this part later.
> > > 
> > Thank you, I'll look into.
> > 
> > > > > @@ -1359,18 +1373,40 @@ charge_cur_mm:
> > > > >  	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> > > > >  }
> > > > >  
> > > > > -void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > > > > +static void
> > > > > +__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> > > > > +					enum charge_type ctype)
> > > > >  {
> > > > > -	struct page_cgroup *pc;
> > > > > +	unsigned long flags;
> > > > > +	struct zone *zone = page_zone(page);
> > > > > +	struct page_cgroup *pc = lookup_page_cgroup(page);
> > > > > +	int locked = 0;
> > > > >  
> > > > >  	if (mem_cgroup_disabled())
> > > > >  		return;
> > > > >  	if (!ptr)
> > > > >  		return;
> > > > > -	pc = lookup_page_cgroup(page);
> > > > > -	mem_cgroup_lru_del_before_commit_swapcache(page);
> > > > > -	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > > > > -	mem_cgroup_lru_add_after_commit_swapcache(page);
> > > > > +
> > > > > +	/*
> > > > > +	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
> > > > > +	 * is guarded by lock_page() because the page is SwapCache.
> > > > > +	 * If this pc is on orphan LRU, it is also removed from orphan LRU here.
> > > > > +	 */
> > > > > +	if (!PageCgroupUsed(pc)) {
> > > > > +		locked = 1;
> > > > > +		spin_lock_irqsave(&zone->lru_lock, flags);
> > > > > +		mem_cgroup_del_lru_list(page, page_lru(page));
> > > > > +	}
> > > > Maybe nice. I tried to use lock_page_cgroup() in add_list but I can't ;(
> > > > I think this works well. But I wonder...why you have to check PageCgroupUsed() ?
> > > > And is it correct ? Removing PageCgroupUsed() bit check is nice.
> > > > (This will be "usually returns true" check, anyway)
> > > > 
> > > I've just copied lru_del_before_commit_swapcache.
> > > 
> > ya, considering now, it seems to be silly quick-hack.
> > 
> > > As you say, this check will return false only in (C) case in memcg_test.txt,
> > > and even in (C) case calling mem_cgroup_del_lru_list(and mem_cgroup_add_lru_list later)
> > > would be no problem.
> > > 
> > > OK, I'll remove this check.
> > > 
> > Thanks,
> > 
> > > This is the updated version(w/o cache_charge cleanup).
> > > 
> > > BTW, Should I merge reclaim part based on your patch and post it ?
> > > 
> > I think not necessary. keeping changes minimum is important as BUGFIX.
> > We can visit here again when new -RC stage starts.
> > 
> > no problem from my review.
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> Just FYI, this version of orphan list framework works fine
> w/o causing BUG more than 24h.
> 
> So, I believe we can implement reclaim part based on this
> to fix the original problem.
> 
ok, but I'd like to wait to start it until the end of merge-window.


Thanks,
-Kame

> 
> Thanks,
> Daisuke Nishimura.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-03-24 23:57                                             ` KAMEZAWA Hiroyuki
@ 2009-04-17  6:34                                               ` Daisuke Nishimura
  2009-04-17  6:54                                                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-04-17  6:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Wed, 25 Mar 2009 08:57:13 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 24 Mar 2009 17:32:18 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Mon, 23 Mar 2009 14:22:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 23 Mar 2009 14:04:19 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > > Nice clean-up here :)
> > > > > 
> > > > Thanks, I'll send a cleanup patch for this part later.
> > > > 
> > > Thank you, I'll look into.
> > > 
> > > > > > @@ -1359,18 +1373,40 @@ charge_cur_mm:
> > > > > >  	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> > > > > >  }
> > > > > >  
> > > > > > -void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > > > > > +static void
> > > > > > +__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
> > > > > > +					enum charge_type ctype)
> > > > > >  {
> > > > > > -	struct page_cgroup *pc;
> > > > > > +	unsigned long flags;
> > > > > > +	struct zone *zone = page_zone(page);
> > > > > > +	struct page_cgroup *pc = lookup_page_cgroup(page);
> > > > > > +	int locked = 0;
> > > > > >  
> > > > > >  	if (mem_cgroup_disabled())
> > > > > >  		return;
> > > > > >  	if (!ptr)
> > > > > >  		return;
> > > > > > -	pc = lookup_page_cgroup(page);
> > > > > > -	mem_cgroup_lru_del_before_commit_swapcache(page);
> > > > > > -	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > > > > > -	mem_cgroup_lru_add_after_commit_swapcache(page);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
> > > > > > +	 * is guarded by lock_page() because the page is SwapCache.
> > > > > > +	 * If this pc is on orphan LRU, it is also removed from orphan LRU here.
> > > > > > +	 */
> > > > > > +	if (!PageCgroupUsed(pc)) {
> > > > > > +		locked = 1;
> > > > > > +		spin_lock_irqsave(&zone->lru_lock, flags);
> > > > > > +		mem_cgroup_del_lru_list(page, page_lru(page));
> > > > > > +	}
> > > > > Maybe nice. I tried to use lock_page_cgroup() in add_list but I can't ;(
> > > > > I think this works well. But I wonder...why you have to check PageCgroupUsed() ?
> > > > > And is it correct ? Removing PageCgroupUsed() bit check is nice.
> > > > > (This will be "usually returns true" check, anyway)
> > > > > 
> > > > I've just copied lru_del_before_commit_swapcache.
> > > > 
> > > ya, considering now, it seems to be silly quick-hack.
> > > 
> > > > As you say, this check will return false only in (C) case in memcg_test.txt,
> > > > and even in (C) case calling mem_cgroup_del_lru_list(and mem_cgroup_add_lru_list later)
> > > > would be no problem.
> > > > 
> > > > OK, I'll remove this check.
> > > > 
> > > Thanks,
> > > 
> > > > This is the updated version(w/o cache_charge cleanup).
> > > > 
> > > > BTW, Should I merge reclaim part based on your patch and post it ?
> > > > 
> > > I think not necessary. keeping changes minimum is important as BUGFIX.
> > > We can visit here again when new -RC stage starts.
> > > 
> > > no problem from my review.
> > > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > Just FYI, this version of orphan list framework works fine
> > w/o causing BUG more than 24h.
> > 
> > So, I believe we can implement reclaim part based on this
> > to fix the original problem.
> > 
> ok, but I'd like to wait to start it until the end of merge-window.
> 
I made a patch for reclaiming SwapCache from orphan LRU based on your patch,
and have been testing it these days.

Major changes from your version:
- count the number of orphan pages per zone and make the threshold per zone(4MB).
- As for type 2 of orphan SwapCache, they are usually set dirty by add_to_swap.
  But try_to_drop_swapcache(__remove_mapping) can't free dirty pages,
  so add a check and try_to_free_swap to the end of shrink_page_list.

It seems work fine, no "pseud leak" of SwapCache can be seen.

What do you think ?
If it's all right, I'll merge this with the orphan list framework patch
and send it to Andrew with other fixes of memcg that I have.

Thanks,
Daisuke Nishimura.
===

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/swap.h |    6 +++
 mm/memcontrol.c      |  119 +++++++++++++++++++++++++++++++++++++++++++++++---
 mm/swapfile.c        |   23 ++++++++++
 mm/vmscan.c          |   20 ++++++++
 4 files changed, 162 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62d8143..02baae1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -311,6 +311,7 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
+extern int try_to_drop_swapcache(struct page *);
 struct backing_dev_info;
 
 /* linux/mm/thrash.c */
@@ -418,6 +419,11 @@ static inline int try_to_free_swap(struct page *page)
 	return 0;
 }
 
+static inline int try_to_drop_swapcache(struct page *page)
+{
+	return 0;
+}
+
 static inline swp_entry_t get_swap_page(void)
 {
 	swp_entry_t entry;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 259b09e..8638c7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -384,10 +384,14 @@ static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
 
 struct orphan_list_node {
 	struct orphan_list_zone {
+		unsigned long count;
 		struct list_head list;
 	} zone[MAX_NR_ZONES];
 };
 struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
+#define ORPHAN_THRESH (1024)	/* 4MB per zone */
+static void try_scan_orphan_list(int, int);
+static int memory_cgroup_is_used __read_mostly;
 
 static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
 {
@@ -399,19 +403,29 @@ static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
 	return  &orphan_list[nid]->zone[zid];
 }
 
-static inline void remove_orphan_list(struct page_cgroup *pc)
+static inline void remove_orphan_list(struct page *page, struct page_cgroup *pc)
 {
+	struct orphan_list_zone *opl;
+
 	ClearPageCgroupOrphan(pc);
+	opl = orphan_lru(page_to_nid(page), page_zonenum(page));
 	list_del_init(&pc->lru);
+	opl->count--;
 }
 
 static inline void add_orphan_list(struct page *page, struct page_cgroup *pc)
 {
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
 	struct orphan_list_zone *opl;
 
 	SetPageCgroupOrphan(pc);
-	opl = orphan_lru(page_to_nid(page), page_zonenum(page));
+	opl = orphan_lru(nid, zid);
 	list_add_tail(&pc->lru, &opl->list);
+	if (unlikely(opl->count++ > ORPHAN_THRESH))
+		/* Orphan is not problem if no mem_cgroup is used */
+		if (memory_cgroup_is_used)
+			try_scan_orphan_list(nid, zid);
 }
 
 
@@ -429,7 +443,7 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 	 * orphan list. remove here
 	 */
 	if (unlikely(PageCgroupOrphan(pc))) {
-		remove_orphan_list(pc);
+		remove_orphan_list(page, pc);
 		return;
 	}
 	/* can happen while we handle swapcache. */
@@ -802,6 +816,89 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
 	return num;
 }
 
+
+/*
+ * In usual, *unused* swap cache are reclaimed by global LRU. But, if no one
+ * kicks global LRU, they will not be reclaimed. When using memcg, it's trouble.
+ */
+static int drain_orphan_swapcaches(int nid, int zid)
+{
+	struct page_cgroup *pc;
+	struct zone *zone;
+	struct page *page;
+	struct orphan_list_zone *opl = orphan_lru(nid, zid);
+	unsigned long flags;
+	int drain = 0;
+	int scan;
+
+	zone = &NODE_DATA(nid)->node_zones[zid];
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	scan = opl->count/5;
+	while (!list_empty(&opl->list) && (scan > 0)) {
+		pc = list_entry(opl->list.next, struct page_cgroup, lru);
+		page = pc->page;
+		/* Rotate */
+		list_del(&pc->lru);
+		list_add_tail(&pc->lru, &opl->list);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		scan--;
+		/* Remove from LRU */
+		if (!isolate_lru_page(page)) { /* get_page is called */
+			if (!page_mapped(page) && trylock_page(page)) {
+				/* This does all necessary jobs */
+				drain += try_to_drop_swapcache(page);
+				unlock_page(page);
+			}
+			putback_lru_page(page); /* put_page is called */
+		}
+		spin_lock_irqsave(&zone->lru_lock, flags);
+	}
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+
+	return drain;
+}
+
+static void try_delete_orphan_caches_all(void)
+{
+	int nid, zid;
+
+	for_each_node_state(nid, N_HIGH_MEMORY)
+		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+			drain_orphan_swapcaches(nid, zid);
+}
+
+/* Only one worker can scan orphan lists at the same time. */
+static atomic_t orphan_scan_worker;
+struct orphan_scan {
+	int node;
+	int zone;
+	struct work_struct work;
+};
+static struct orphan_scan orphan_scan;
+
+static void try_delete_orphan_caches(struct work_struct *work)
+{
+	int nid, zid;
+
+	nid = orphan_scan.node;
+	zid = orphan_scan.zone;
+	drain_orphan_swapcaches(nid, zid);
+	atomic_dec(&orphan_scan_worker);
+}
+
+static void try_scan_orphan_list(int nid, int zid)
+{
+	if (atomic_inc_return(&orphan_scan_worker) > 1) {
+		atomic_dec(&orphan_scan_worker);
+		return;
+	}
+	orphan_scan.node = nid;
+	orphan_scan.zone = zid;
+	INIT_WORK(&orphan_scan.work, try_delete_orphan_caches);
+	schedule_work(&orphan_scan.work);
+	/* call back function decrements orphan_scan_worker */
+}
+
 static __init void init_orphan_lru(void)
 {
 	struct orphan_list_node *opl;
@@ -814,8 +911,10 @@ static __init void init_orphan_lru(void)
 		else
 			opl = kmalloc(size, GFP_KERNEL);
 		BUG_ON(!opl);
-		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 			INIT_LIST_HEAD(&opl->zone[zid].list);
+			opl->zone[zid].count = 0;
+		}
 		orphan_list[nid] = opl;
 	}
 }
@@ -1000,6 +1099,10 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (ret)
 			continue;
 
+		/* unused SwapCache might pressure the memsw usage */
+		if (nr_retries < MEM_CGROUP_RECLAIM_RETRIES/2 && noswap)
+			try_delete_orphan_caches_all();
+
 		/*
 		 * try_to_free_mem_cgroup_pages() might not give us a full
 		 * picture of reclaim. Some pages are reclaimed and might be
@@ -1787,9 +1890,12 @@ int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		mem_cgroup_hierarchical_reclaim(memcg, GFP_KERNEL, true, true);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
-		if (curusage >= oldusage)
+		if (curusage >= oldusage) {
 			retry_count--;
-		else
+			/* unused SwapCache might pressure the memsw usage */
+			if (retry_count < MEM_CGROUP_RECLAIM_RETRIES/2)
+				try_delete_orphan_caches_all();
+		} else
 			oldusage = curusage;
 	}
 	return ret;
@@ -2483,6 +2589,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
+		memory_cgroup_is_used = 1;
 	}
 
 	if (parent && parent->use_hierarchy) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 312fafe..9416196 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -571,6 +571,29 @@ int try_to_free_swap(struct page *page)
 }
 
 /*
+ * Similar to try_to_free_swap() but this drops SwapCache without checking
+ * page_swapcount(). By this, this function removes not only unused swap entry
+ * but alos a swap-cache which is on memory but never used.
+ * The caller should have a reference to this page and it must be locked.
+ */
+int try_to_drop_swapcache(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+
+	if (!PageSwapCache(page))
+		return 0;
+	if (PageWriteback(page))
+		return 0;
+	if (page_mapped(page))
+		return 0;
+	/*
+	 * remove_mapping() will success only when there is no extra
+	 * user of swap cache. (Keeping sanity be speculative lookup)
+	 */
+	return remove_mapping(&swapper_space, page);
+}
+
+/*
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.
  */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 38d7506..b123eca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -38,6 +38,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
 
@@ -785,6 +786,25 @@ activate_locked:
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
+		if (!scanning_global_lru(sc) && PageSwapCache(page)) {
+			struct page_cgroup *pc;
+
+			pc = lookup_page_cgroup(page);
+			/*
+			 * Used bit of swapcache is solid under page lock.
+			 */
+			if (unlikely(!PageCgroupUsed(pc)))
+				/*
+				 * This can happen if the page is unmapped by
+				 * the owner process before it is added to
+				 * swapcache.
+				 * These swapcaches are usually set dirty by
+				 * add_to_swap, but try_to_drop_swapcache can't
+				 * free dirty swapcaches.
+				 * So free these swapcaches here.
+				 */
+				try_to_free_swap(page);
+		}
 		unlock_page(page);
 keep:
 		list_add(&page->lru, &ret_pages);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-17  6:34                                               ` Daisuke Nishimura
@ 2009-04-17  6:54                                                 ` KAMEZAWA Hiroyuki
  2009-04-17  7:50                                                   ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-17  6:54 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Fri, 17 Apr 2009 15:34:55 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> I made a patch for reclaiming SwapCache from orphan LRU based on your patch,
> and have been testing it these days.
> 
Good trial! 
Honestly, I've written a patch to fix this problem in these days but seems to
be over-kill ;)


> Major changes from your version:
> - count the number of orphan pages per zone and make the threshold per zone(4MB).
> - As for type 2 of orphan SwapCache, they are usually set dirty by add_to_swap.
>   But try_to_drop_swapcache(__remove_mapping) can't free dirty pages,
>   so add a check and try_to_free_swap to the end of shrink_page_list.
> 
> It seems work fine, no "pseud leak" of SwapCache can be seen.
> 
> What do you think ?
> If it's all right, I'll merge this with the orphan list framework patch
> and send it to Andrew with other fixes of memcg that I have.
> 
I'm sorry but my answer is "please wait". The reason is..

1. When global LRU works, the pages will be reclaimed.
2. Global LRU will work finally.
3. While testing, "stale" swap cache cannot be big amount.

But, after "soft limit", the situaion will change.
1. Even when global LRU works, page selection is influenced by memcg.
2. So, when we implement soft-limit, we need to handle swap-cache.

Your patch will be necessary finally in near future. But, now, it just
adds code and cannot be very much help, I think.

So, my answer is "please wait"


> Thanks,
> Daisuke Nishimura.
> ===
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/swap.h |    6 +++
>  mm/memcontrol.c      |  119 +++++++++++++++++++++++++++++++++++++++++++++++---
>  mm/swapfile.c        |   23 ++++++++++
>  mm/vmscan.c          |   20 ++++++++
>  4 files changed, 162 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 62d8143..02baae1 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -311,6 +311,7 @@ extern sector_t swapdev_block(int, pgoff_t);
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
>  extern int reuse_swap_page(struct page *);
>  extern int try_to_free_swap(struct page *);
> +extern int try_to_drop_swapcache(struct page *);
>  struct backing_dev_info;
>  
>  /* linux/mm/thrash.c */
> @@ -418,6 +419,11 @@ static inline int try_to_free_swap(struct page *page)
>  	return 0;
>  }
>  
> +static inline int try_to_drop_swapcache(struct page *page)
> +{
> +	return 0;
> +}
> +
>  static inline swp_entry_t get_swap_page(void)
>  {
>  	swp_entry_t entry;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 259b09e..8638c7b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -384,10 +384,14 @@ static int mem_cgroup_walk_tree(struct mem_cgroup *root, void *data,
>  
>  struct orphan_list_node {
>  	struct orphan_list_zone {
> +		unsigned long count;
>  		struct list_head list;
>  	} zone[MAX_NR_ZONES];
>  };
>  struct orphan_list_node *orphan_list[MAX_NUMNODES] __read_mostly;
> +#define ORPHAN_THRESH (1024)	/* 4MB per zone */
> +static void try_scan_orphan_list(int, int);
> +static int memory_cgroup_is_used __read_mostly;
>  
>  static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
>  {
> @@ -399,19 +403,29 @@ static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
>  	return  &orphan_list[nid]->zone[zid];
>  }
>  
> -static inline void remove_orphan_list(struct page_cgroup *pc)
> +static inline void remove_orphan_list(struct page *page, struct page_cgroup *pc)
>  {
> +	struct orphan_list_zone *opl;
> +
>  	ClearPageCgroupOrphan(pc);

I wonder lock_page_cgroup() is necessary or not here..


> +	opl = orphan_lru(page_to_nid(page), page_zonenum(page));
>  	list_del_init(&pc->lru);
> +	opl->count--;
>  }
>  
>  static inline void add_orphan_list(struct page *page, struct page_cgroup *pc)
>  {
> +	int nid = page_to_nid(page);
> +	int zid = page_zonenum(page);
>  	struct orphan_list_zone *opl;
>  
>  	SetPageCgroupOrphan(pc);

here too.

I'm sorry plz give me time. I'd like to new version of post soft-limit patches
in the next week. I'm sorry for delayed my works.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-17  6:54                                                 ` KAMEZAWA Hiroyuki
@ 2009-04-17  7:50                                                   ` Daisuke Nishimura
  2009-04-17  7:58                                                     ` KAMEZAWA Hiroyuki
  2009-04-17  8:11                                                     ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 36+ messages in thread
From: Daisuke Nishimura @ 2009-04-17  7:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Fri, 17 Apr 2009 15:54:11 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 17 Apr 2009 15:34:55 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > I made a patch for reclaiming SwapCache from orphan LRU based on your patch,
> > and have been testing it these days.
> > 
> Good trial! 
> Honestly, I've written a patch to fix this problem in these days but seems to
> be over-kill ;)
> 
> 
> > Major changes from your version:
> > - count the number of orphan pages per zone and make the threshold per zone(4MB).
> > - As for type 2 of orphan SwapCache, they are usually set dirty by add_to_swap.
> >   But try_to_drop_swapcache(__remove_mapping) can't free dirty pages,
> >   so add a check and try_to_free_swap to the end of shrink_page_list.
> > 
> > It seems work fine, no "pseud leak" of SwapCache can be seen.
> > 
> > What do you think ?
> > If it's all right, I'll merge this with the orphan list framework patch
> > and send it to Andrew with other fixes of memcg that I have.
> > 
> I'm sorry but my answer is "please wait". The reason is..
> 
> 1. When global LRU works, the pages will be reclaimed.
> 2. Global LRU will work finally.
> 3. While testing, "stale" swap cache cannot be big amount.
> 
Hmm, I can't understand 2.

If (memsize on system) >> (swapsize on system), global LRU doesn't run
and all the swap space can be used up by these SwapCache.
This means setting mem.limit can use up all the swap space on the system.
I've tested with 50MB size of swap and it can be used up in less than 24h.
I think it's not small.

> But, after "soft limit", the situaion will change.
> 1. Even when global LRU works, page selection is influenced by memcg.
> 2. So, when we implement soft-limit, we need to handle swap-cache.
> 
> Your patch will be necessary finally in near future. But, now, it just
> adds code and cannot be very much help, I think.
> 
> So, my answer is "please wait"
> 
> 
> > @@ -399,19 +403,29 @@ static inline struct orphan_list_zone *orphan_lru(int nid, int zid)
> >  	return  &orphan_list[nid]->zone[zid];
> >  }
> >  
> > -static inline void remove_orphan_list(struct page_cgroup *pc)
> > +static inline void remove_orphan_list(struct page *page, struct page_cgroup *pc)
> >  {
> > +	struct orphan_list_zone *opl;
> > +
> >  	ClearPageCgroupOrphan(pc);
> 
> I wonder lock_page_cgroup() is necessary or not here..
> 
> 
> > +	opl = orphan_lru(page_to_nid(page), page_zonenum(page));
> >  	list_del_init(&pc->lru);
> > +	opl->count--;
> >  }
> >  
> >  static inline void add_orphan_list(struct page *page, struct page_cgroup *pc)
> >  {
> > +	int nid = page_to_nid(page);
> > +	int zid = page_zonenum(page);
> >  	struct orphan_list_zone *opl;
> >  
> >  	SetPageCgroupOrphan(pc);
> 
> here too.
> 
I think PCG_ORPHAN is protected by zone->lru_lock.

Thanks,
Daisuke Nishimura.

> I'm sorry plz give me time. I'd like to new version of post soft-limit patches
> in the next week. I'm sorry for delayed my works.
> 
> Thanks,
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-17  7:50                                                   ` Daisuke Nishimura
@ 2009-04-17  7:58                                                     ` KAMEZAWA Hiroyuki
  2009-04-17  8:12                                                       ` Daisuke Nishimura
  2009-04-17  8:11                                                     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-17  7:58 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Fri, 17 Apr 2009 16:50:36 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 17 Apr 2009 15:54:11 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 17 Apr 2009 15:34:55 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > I made a patch for reclaiming SwapCache from orphan LRU based on your patch,
> > > and have been testing it these days.
> > > 
> > Good trial! 
> > Honestly, I've written a patch to fix this problem in these days but seems to
> > be over-kill ;)
> > 
> > 
> > > Major changes from your version:
> > > - count the number of orphan pages per zone and make the threshold per zone(4MB).
> > > - As for type 2 of orphan SwapCache, they are usually set dirty by add_to_swap.
> > >   But try_to_drop_swapcache(__remove_mapping) can't free dirty pages,
> > >   so add a check and try_to_free_swap to the end of shrink_page_list.
> > > 
> > > It seems work fine, no "pseud leak" of SwapCache can be seen.
> > > 
> > > What do you think ?
> > > If it's all right, I'll merge this with the orphan list framework patch
> > > and send it to Andrew with other fixes of memcg that I have.
> > > 
> > I'm sorry but my answer is "please wait". The reason is..
> > 
> > 1. When global LRU works, the pages will be reclaimed.
> > 2. Global LRU will work finally.
> > 3. While testing, "stale" swap cache cannot be big amount.
> > 
> Hmm, I can't understand 2.
> 
> If (memsize on system) >> (swapsize on system), global LRU doesn't run
> and all the swap space can be used up by these SwapCache.
> This means setting mem.limit can use up all the swap space on the system.
> I've tested with 50MB size of swap and it can be used up in less than 24h.
> I think it's not small.
> 

plz add hook to shrink_zone() to fix this as you did. 
orphan list is overkilling at this stage.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-17  7:50                                                   ` Daisuke Nishimura
  2009-04-17  7:58                                                     ` KAMEZAWA Hiroyuki
@ 2009-04-17  8:11                                                     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-17  8:11 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Fri, 17 Apr 2009 16:50:36 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > +	opl = orphan_lru(page_to_nid(page), page_zonenum(page));
> > >  	list_del_init(&pc->lru);
> > > +	opl->count--;
> > >  }
> > >  
> > >  static inline void add_orphan_list(struct page *page, struct page_cgroup *pc)
> > >  {
> > > +	int nid = page_to_nid(page);
> > > +	int zid = page_zonenum(page);
> > >  	struct orphan_list_zone *opl;
> > >  
> > >  	SetPageCgroupOrphan(pc);
> > 
> > here too.
> > 
> I think PCG_ORPHAN is protected by zone->lru_lock.
> 

There is different condition for swap caches from file-cache/anonymous pages.

File Cache and Anon pages are marked as USED before the first call of add_to_lru.
So, commit_charge_swapin()'s following code never breaks page_cgroup->flags.

 948         pc->mem_cgroup = mem;
 949         smp_wmb();
 950         pc->flags = pcg_default_flags[ctype];

Then, pc->flags can be broken.

please notice that

  43 /* Cache flag is set only once (at allocation) */
  44 TESTPCGFLAG(Cache, CACHE)
  45 
  46 TESTPCGFLAG(Used, USED)
  47 CLEARPCGFLAG(Used, USED)

ClearPageCgroupUsed() is only operation which modifes page_cgroup->flags,
but it's done under lock.

If you want to avoid lock_page_cgroup(), please rewrite commit_charge_swapin to do

    SetPageCgroupUsed(pc);
    SetPageCgroupCache(pc);
    ....
    or
   atomic_cmpxchg(&pc->flags, oldval, pcg_dafaule_flags[ctype]);

or some.

I'd like to divide lock bit and flags bit etc.. but cannot find a way to do it.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-17  7:58                                                     ` KAMEZAWA Hiroyuki
@ 2009-04-17  8:12                                                       ` Daisuke Nishimura
  2009-04-17  8:13                                                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-04-17  8:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Fri, 17 Apr 2009 16:58:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 17 Apr 2009 16:50:36 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Fri, 17 Apr 2009 15:54:11 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Fri, 17 Apr 2009 15:34:55 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > I made a patch for reclaiming SwapCache from orphan LRU based on your patch,
> > > > and have been testing it these days.
> > > > 
> > > Good trial! 
> > > Honestly, I've written a patch to fix this problem in these days but seems to
> > > be over-kill ;)
> > > 
> > > 
> > > > Major changes from your version:
> > > > - count the number of orphan pages per zone and make the threshold per zone(4MB).
> > > > - As for type 2 of orphan SwapCache, they are usually set dirty by add_to_swap.
> > > >   But try_to_drop_swapcache(__remove_mapping) can't free dirty pages,
> > > >   so add a check and try_to_free_swap to the end of shrink_page_list.
> > > > 
> > > > It seems work fine, no "pseud leak" of SwapCache can be seen.
> > > > 
> > > > What do you think ?
> > > > If it's all right, I'll merge this with the orphan list framework patch
> > > > and send it to Andrew with other fixes of memcg that I have.
> > > > 
> > > I'm sorry but my answer is "please wait". The reason is..
> > > 
> > > 1. When global LRU works, the pages will be reclaimed.
> > > 2. Global LRU will work finally.
> > > 3. While testing, "stale" swap cache cannot be big amount.
> > > 
> > Hmm, I can't understand 2.
> > 
> > If (memsize on system) >> (swapsize on system), global LRU doesn't run
> > and all the swap space can be used up by these SwapCache.
> > This means setting mem.limit can use up all the swap space on the system.
> > I've tested with 50MB size of swap and it can be used up in less than 24h.
> > I think it's not small.
> > 
> 
> plz add hook to shrink_zone() to fix this as you did. 
> orphan list is overkilling at this stage.
> 
I see.

I'll make a patch, test it, and repost it in next week.
It can prevent at least type-2 of orphan SwapCache.

I'll revisit orphan list if needed in future.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-17  8:12                                                       ` Daisuke Nishimura
@ 2009-04-17  8:13                                                         ` KAMEZAWA Hiroyuki
  2009-04-21  2:35                                                           ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-17  8:13 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Fri, 17 Apr 2009 17:12:01 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 17 Apr 2009 16:58:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 17 Apr 2009 16:50:36 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > On Fri, 17 Apr 2009 15:54:11 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > On Fri, 17 Apr 2009 15:34:55 +0900
> > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > > I made a patch for reclaiming SwapCache from orphan LRU based on your patch,
> > > > > and have been testing it these days.
> > > > > 
> > > > Good trial! 
> > > > Honestly, I've written a patch to fix this problem in these days but seems to
> > > > be over-kill ;)
> > > > 
> > > > 
> > > > > Major changes from your version:
> > > > > - count the number of orphan pages per zone and make the threshold per zone(4MB).
> > > > > - As for type 2 of orphan SwapCache, they are usually set dirty by add_to_swap.
> > > > >   But try_to_drop_swapcache(__remove_mapping) can't free dirty pages,
> > > > >   so add a check and try_to_free_swap to the end of shrink_page_list.
> > > > > 
> > > > > It seems work fine, no "pseud leak" of SwapCache can be seen.
> > > > > 
> > > > > What do you think ?
> > > > > If it's all right, I'll merge this with the orphan list framework patch
> > > > > and send it to Andrew with other fixes of memcg that I have.
> > > > > 
> > > > I'm sorry but my answer is "please wait". The reason is..
> > > > 
> > > > 1. When global LRU works, the pages will be reclaimed.
> > > > 2. Global LRU will work finally.
> > > > 3. While testing, "stale" swap cache cannot be big amount.
> > > > 
> > > Hmm, I can't understand 2.
> > > 
> > > If (memsize on system) >> (swapsize on system), global LRU doesn't run
> > > and all the swap space can be used up by these SwapCache.
> > > This means setting mem.limit can use up all the swap space on the system.
> > > I've tested with 50MB size of swap and it can be used up in less than 24h.
> > > I think it's not small.
> > > 
> > 
> > plz add hook to shrink_zone() to fix this as you did. 
> > orphan list is overkilling at this stage.
> > 
> I see.
> 
> I'll make a patch, test it, and repost it in next week.
> It can prevent at least type-2 of orphan SwapCache.
> 
BTW, type-1 still exits ?

> I'll revisit orphan list if needed in future.
> 
Thank you!.

Regards,
-Kame

> 
> Thanks,
> Daisuke Nishimura.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-17  8:13                                                         ` KAMEZAWA Hiroyuki
@ 2009-04-21  2:35                                                           ` Daisuke Nishimura
  2009-04-21  2:57                                                             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: Daisuke Nishimura @ 2009-04-21  2:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

Sorry for late reply.

On Fri, 17 Apr 2009 17:13:43 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 17 Apr 2009 17:12:01 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Fri, 17 Apr 2009 16:58:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Fri, 17 Apr 2009 16:50:36 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > On Fri, 17 Apr 2009 15:54:11 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > On Fri, 17 Apr 2009 15:34:55 +0900
> > > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > > > I made a patch for reclaiming SwapCache from orphan LRU based on your patch,
> > > > > > and have been testing it these days.
> > > > > > 
> > > > > Good trial! 
> > > > > Honestly, I've written a patch to fix this problem in these days but seems to
> > > > > be over-kill ;)
> > > > > 
> > > > > 
> > > > > > Major changes from your version:
> > > > > > - count the number of orphan pages per zone and make the threshold per zone(4MB).
> > > > > > - As for type 2 of orphan SwapCache, they are usually set dirty by add_to_swap.
> > > > > >   But try_to_drop_swapcache(__remove_mapping) can't free dirty pages,
> > > > > >   so add a check and try_to_free_swap to the end of shrink_page_list.
> > > > > > 
> > > > > > It seems work fine, no "pseud leak" of SwapCache can be seen.
> > > > > > 
> > > > > > What do you think ?
> > > > > > If it's all right, I'll merge this with the orphan list framework patch
> > > > > > and send it to Andrew with other fixes of memcg that I have.
> > > > > > 
> > > > > I'm sorry but my answer is "please wait". The reason is..
> > > > > 
> > > > > 1. When global LRU works, the pages will be reclaimed.
> > > > > 2. Global LRU will work finally.
> > > > > 3. While testing, "stale" swap cache cannot be big amount.
> > > > > 
> > > > Hmm, I can't understand 2.
> > > > 
> > > > If (memsize on system) >> (swapsize on system), global LRU doesn't run
> > > > and all the swap space can be used up by these SwapCache.
> > > > This means setting mem.limit can use up all the swap space on the system.
> > > > I've tested with 50MB size of swap and it can be used up in less than 24h.
> > > > I think it's not small.
> > > > 
> > > 
> > > plz add hook to shrink_zone() to fix this as you did. 
> > > orphan list is overkilling at this stage.
> > > 
> > I see.
> > 
> > I'll make a patch, test it, and repost it in next week.
> > It can prevent at least type-2 of orphan SwapCache.
> > 
> BTW, type-1 still exits ?
> 
Ah, I meant adding to shrink_page_list():

@@ -785,6 +786,23 @@ activate_locked:
                SetPageActive(page);
                pgactivate++;
 keep_locked:
+               if (!scanning_global_lru(sc) && PageSwapCache(page)) {
+                       struct page_cgroup *pc;
+
+                       pc = lookup_page_cgroup(page);
+                       /*
+                        * Used bit of swapcache is solid under page lock.
+                        */
+                       if (unlikely(!PageCgroupUsed(pc)))
+                               /*
+                                * This can happen if the page is free'ed by
+                                * the owner process before it is added to
+                                * swapcache.
+                                * These swapcache cannot be managed by memcg
+                                * well, so free it here.
+                                */
+                               try_to_free_swap(page);
+               }
                unlock_page(page);
 keep:
                list_add(&page->lru, &ret_pages);

This cannot prevent type-1 orphan SwapCache(caused by the race
between exit() and swap-in readahead).
Type-1 can pressure the memsw usage(trigger OOM if memsw.limit is set, as a result)
and make struct mem_cgroup unfreeable even after rmdir(because it holds refcount
to mem_cgroup).

Do you have any ideas to solve orphan SwapCache problem by adding some hooks to shrink_zone() ?
(scan some pages from global LRU and check whether it's orphan SwapCache or not by
adding some code like above ?)

And, what do you think about adding above code to shrink_page_list() ?
I think it might be unnecessary if we can solve the problem in another way, though.


Thanks,
Daisuke Nishimura.

> > I'll revisit orphan list if needed in future.
> > 
> Thank you!.
> 
> Regards,
> -Kame
> 
> > 
> > Thanks,
> > Daisuke Nishimura.
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-21  2:35                                                           ` Daisuke Nishimura
@ 2009-04-21  2:57                                                             ` KAMEZAWA Hiroyuki
  2009-04-21  4:05                                                               ` Daisuke Nishimura
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-21  2:57 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Tue, 21 Apr 2009 11:35:25 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> @@ -785,6 +786,23 @@ activate_locked:
>                 SetPageActive(page);
>                 pgactivate++;
>  keep_locked:
> +               if (!scanning_global_lru(sc) && PageSwapCache(page)) {
> +                       struct page_cgroup *pc;
> +
> +                       pc = lookup_page_cgroup(page);
> +                       /*
> +                        * Used bit of swapcache is solid under page lock.
> +                        */
> +                       if (unlikely(!PageCgroupUsed(pc)))
> +                               /*
> +                                * This can happen if the page is free'ed by
> +                                * the owner process before it is added to
> +                                * swapcache.
> +                                * These swapcache cannot be managed by memcg
> +                                * well, so free it here.
> +                                */
> +                               try_to_free_swap(page);
> +               }
>                 unlock_page(page);
>  keep:
>                 list_add(&page->lru, &ret_pages);
> 
> This cannot prevent type-1 orphan SwapCache(caused by the race
> between exit() and swap-in readahead).
> Type-1 can pressure the memsw usage(trigger OOM if memsw.limit is set, as a result)
> and make struct mem_cgroup unfreeable even after rmdir(because it holds refcount
> to mem_cgroup).
Hmm.
   free_swap_cache()
	-> trylock_page() => failure case ?

add following codes.
==
 588                         page = find_get_page(&swapper_space, entry.val);
 589                         if (page && !trylock_page(page)) {
				     mem_cgroup_retry_free_swap_lazy(page);  <=====
 590                                 page_cache_release(page);
 591                                 page = NULL;
 592                         }
==
and  do some kind of lazy ops..I'll try some.

> 
> Do you have any ideas to solve orphan SwapCache problem by adding some hooks to shrink_zone() ?
> (scan some pages from global LRU and check whether it's orphan SwapCache or not by
> adding some code like above ?)
> 
> And, what do you think about adding above code to shrink_page_list() ?
> I think it might be unnecessary if we can solve the problem in another way, though.
> 

I think your hook itself is not very bad. (even if we remove this later..)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] fix unused/stale swap cache handling on memcg  v3
  2009-04-21  2:57                                                             ` KAMEZAWA Hiroyuki
@ 2009-04-21  4:05                                                               ` Daisuke Nishimura
  0 siblings, 0 replies; 36+ messages in thread
From: Daisuke Nishimura @ 2009-04-21  4:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, Daisuke Nishimura, linux-mm, Balbir Singh, Hugh Dickins

On Tue, 21 Apr 2009 11:57:49 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 21 Apr 2009 11:35:25 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > @@ -785,6 +786,23 @@ activate_locked:
> >                 SetPageActive(page);
> >                 pgactivate++;
> >  keep_locked:
> > +               if (!scanning_global_lru(sc) && PageSwapCache(page)) {
> > +                       struct page_cgroup *pc;
> > +
> > +                       pc = lookup_page_cgroup(page);
> > +                       /*
> > +                        * Used bit of swapcache is solid under page lock.
> > +                        */
> > +                       if (unlikely(!PageCgroupUsed(pc)))
> > +                               /*
> > +                                * This can happen if the page is free'ed by
> > +                                * the owner process before it is added to
> > +                                * swapcache.
> > +                                * These swapcache cannot be managed by memcg
> > +                                * well, so free it here.
> > +                                */
> > +                               try_to_free_swap(page);
> > +               }
> >                 unlock_page(page);
> >  keep:
> >                 list_add(&page->lru, &ret_pages);
> > 
> > This cannot prevent type-1 orphan SwapCache(caused by the race
> > between exit() and swap-in readahead).
> > Type-1 can pressure the memsw usage(trigger OOM if memsw.limit is set, as a result)
> > and make struct mem_cgroup unfreeable even after rmdir(because it holds refcount
> > to mem_cgroup).
> Hmm.
>    free_swap_cache()
> 	-> trylock_page() => failure case ?
> 
Yes, but there is another case:

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
      swap_entry_free() == 1           |
      find_get_page() -> cannot find   |
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.


> add following codes.
> ==
>  588                         page = find_get_page(&swapper_space, entry.val);
>  589                         if (page && !trylock_page(page)) {
> 				     mem_cgroup_retry_free_swap_lazy(page);  <=====
>  590                                 page_cache_release(page);
>  591                                 page = NULL;
>  592                         }
> ==
> and  do some kind of lazy ops..I'll try some.
> 
> > 
> > Do you have any ideas to solve orphan SwapCache problem by adding some hooks to shrink_zone() ?
> > (scan some pages from global LRU and check whether it's orphan SwapCache or not by
> > adding some code like above ?)
> > 
> > And, what do you think about adding above code to shrink_page_list() ?
> > I think it might be unnecessary if we can solve the problem in another way, though.
> > 
> 
> I think your hook itself is not very bad. (even if we remove this later..)
> 
I think it depends on how we fix the type-1 whether we should remove this or not.
Anyway, I'll leave it as it is for now.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2009-04-21  4:09 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-17  4:57 [RFC] memcg: handle swapcache leak Daisuke Nishimura
2009-03-17  5:39 ` KAMEZAWA Hiroyuki
2009-03-17  6:11   ` Daisuke Nishimura
2009-03-17  7:29     ` KAMEZAWA Hiroyuki
2009-03-17  9:38       ` KAMEZAWA Hiroyuki
2009-03-18  1:17         ` Daisuke Nishimura
2009-03-18  1:34           ` KAMEZAWA Hiroyuki
2009-03-18  3:51             ` Daisuke Nishimura
2009-03-18  4:05               ` KAMEZAWA Hiroyuki
2009-03-18  8:57               ` [PATCH] fix unused/stale swap cache handling on memcg v1 (Re: " KAMEZAWA Hiroyuki
2009-03-18 14:17                 ` Daisuke Nishimura
2009-03-18 23:45                   ` KAMEZAWA Hiroyuki
2009-03-19  2:16                     ` KAMEZAWA Hiroyuki
2009-03-19  9:06                       ` [PATCH] fix unused/stale swap cache handling on memcg v2 KAMEZAWA Hiroyuki
2009-03-19 10:01                         ` Daisuke Nishimura
2009-03-19 10:13                           ` Daisuke Nishimura
2009-03-19 10:46                             ` KAMEZAWA Hiroyuki
2009-03-19 11:36                               ` KAMEZAWA Hiroyuki
2009-03-20  7:45                                 ` [PATCH] fix unused/stale swap cache handling on memcg v3 KAMEZAWA Hiroyuki
2009-03-23  1:45                                   ` Daisuke Nishimura
2009-03-23  2:41                                     ` KAMEZAWA Hiroyuki
2009-03-23  5:04                                       ` Daisuke Nishimura
2009-03-23  5:22                                         ` KAMEZAWA Hiroyuki
2009-03-24  8:32                                           ` Daisuke Nishimura
2009-03-24 23:57                                             ` KAMEZAWA Hiroyuki
2009-04-17  6:34                                               ` Daisuke Nishimura
2009-04-17  6:54                                                 ` KAMEZAWA Hiroyuki
2009-04-17  7:50                                                   ` Daisuke Nishimura
2009-04-17  7:58                                                     ` KAMEZAWA Hiroyuki
2009-04-17  8:12                                                       ` Daisuke Nishimura
2009-04-17  8:13                                                         ` KAMEZAWA Hiroyuki
2009-04-21  2:35                                                           ` Daisuke Nishimura
2009-04-21  2:57                                                             ` KAMEZAWA Hiroyuki
2009-04-21  4:05                                                               ` Daisuke Nishimura
2009-04-17  8:11                                                     ` KAMEZAWA Hiroyuki
2009-03-18  0:08       ` [RFC] memcg: handle swapcache leak Daisuke Nishimura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.